Thread: Proposal: "Causal reads" mode for load balancing reads without stale data
Hi hackers,
Many sites use hot standby servers to spread read-heavy workloads over more hardware, or at least would like to. This works well today if your application can tolerate some time lag on standbys. The problem is that there is no guarantee of when a particular commit will become visible for clients connected to standbys. The existing synchronous commit feature is no help here because it guarantees only that the WAL has been flushed on another server before commit returns. It says nothing about whether it has been applied or whether it has been applied on the standby that you happen to be talking to.
A while ago I posted a small patch[1] to allow synchronous_commit to wait for remote apply on the current synchronous standby, but (as Simon Riggs rightly pointed out in that thread) that part isn't the main problem. It seems to me that the main problem for a practical 'writer waits' system is how to support a dynamic set of servers, gracefully tolerating failures and timing out laggards, while also providing a strong guarantee during any such transitions. Now I would like to propose something to do that, and share a proof-of-concept patch.
=== PROPOSAL ===
The working name for the proposed feature is "causal reads", because it provides a form of "causal consistency"[2] (and "read-your-writes" consistency) no matter which server the client is connected to. There is a similar feature by the same name in another product (albeit implemented differently -- 'reader waits'; more about that later). I'm not wedded to the name.
The feature allows arbitrary read-only transactions to be run on any hot standby, with a specific guarantee about the visibility of preceding transactions. The guarantee is that if you set a new GUC "causal_reads = on" in any pair of consecutive transactions (tx1, tx2) where tx2 begins after tx1 successfully returns, then tx2 will either see tx1 or fail with a new error "standby is not available for causal reads", no matter which server it runs on. A discovery mechanism is also provided, giving an instantaneous snapshot of the set of standbys that are currently available for causal reads (ie won't raise the error), in the form of a new column in pg_stat_replication.
For example, a web server might run tx1 to insert a new row representing a message in a discussion forum on the primary server, and then send the user to another web page that runs tx2 to load all messages in the forum on an arbitrary hot standby server. If causal_reads = on in both tx1 and tx2 (for example, because it's on globally), then tx2 is guaranteed to see the new post, or get a (hopefully rare) error telling the client to retry on another server.
Very briefly, the approach is:
1. The primary tracks apply lag on each standby (including between commits).
2. The primary deems standbys 'available' for causal reads if they are applying WAL and replying to keepalives fast enough, and periodically sends the standby an authorization to consider itself available for causal reads until a time in the near future.
3. Commit on the primary with "causal_reads = on" waits for all 'available' standbys either to apply the commit record, or to cease to be 'available' and begin raising the error if they are still alive (because their authorizations have expired).
4. Standbys can start causal reads transactions only while they have an authorization with an expiry time in the future; otherwise they raise an error when an initial snapshot is taken.
In a follow-up email I can write about the design trade-offs considered (mainly 'writer waits' vs 'reader waits'), comparison with some other products, method of estimating replay lag, wait and timeout logic and how it maintains the guarantee in various failure scenarios, logic for standbys joining and leaving, implications of system clock skew between servers, or any other questions you may have, depending on feedback/interest (but see comments in the attached patch for some of those subjects). For now I didn't want to clog up the intertubes with too large a wall of text.
=== PROOF-OF-CONCEPT ===
Please see the POC patch attached. It adds two new GUCs. After setting up one or more hot standbys as per usual, simply add "causal_reads_timeout = 4s" to the primary's postgresql.conf and restart. Now, you can set "causal_reads = on" in some/all sessions to get guaranteed causal consistency. Expected behaviour: the causal reads guarantee is maintained at all times, even when you overwhelm, kill, crash, disconnect, restart, pause, add and remove standbys, and the primary drops them from the set it waits for in a timely fashion. You can monitor the system with the replay_lag and causal_reads_status in pg_stat_replication and some state transition LOG messages on the primary. (The patch also supports "synchronous_commit = apply", but it's not clear how useful that is in practice, as already discussed.)
Lastly, a few notes about how this feature related to some other work:
The current version of this patch has causal_reads as a feature separate from synchronous_commit, from a user's point of view. The thinking behind this is that load balancing and data loss avoidance are separate concerns: synchronous_commit deals with the latter, and causal_reads with the former. That said, existing SyncRep machinery is obviously used (specifically SyncRep queues, with a small modification, as a way to wait for apply messages to arrive from standbys). (An earlier prototype had causal reads as a new level for synchronous_commit and associated states as new walsender states above 'streaming'. When contemplating how to combine this proposal with the multiple-synchronous-standby patch, some colleagues and I came around to the view that the concerns are separate. The reason for wanting to configure complicated quorum definitions is to control data loss risks and has nothing to do with load balancing requirements, so we thought the features should probably be separate.)
The multiple-synchronous-servers patch[3] could be applied or not independently of this feature as a result of that separation, as it doesn't use synchronous_standby_names or indeed any kind of statically defined quorum.
The standby WAL writer patch[4] would significantly improve walreceiver performance and smoothness which would work very well with this proposal.
Please let me know what you think!
Thanks,
"Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access by process B will return the updated value, and a write is guaranteed to supersede the earlier write. Access by process C that has no causal relationship to process A is subject to the normal eventual consistency rules.
Read-your-writes consistency. This is an important model where process A, after it has updated a data item, always accesses the updated value and will never see an older value. This is a special case of the causal consistency model."
Thomas Munro
http://www.enterprisedb.com
http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Simon Riggs
Date:
On 11 November 2015 at 05:37, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
--
Many sites use hot standby servers to spread read-heavy workloads over more hardware, or at least would like to. This works well today if your application can tolerate some time lag on standbys. The problem is that there is no guarantee of when a particular commit will become visible for clients connected to standbys. The existing synchronous commit feature is no help here because it guarantees only that the WAL has been flushed on another server before commit returns. It says nothing about whether it has been applied or whether it has been applied on the standby that you happen to be talking to.
Thanks for working on this issue.
3. Commit on the primary with "causal_reads = on" waits for all 'available' standbys either to apply the commit record, or to cease to be 'available' and begin raising the error if they are still alive (because their authorizations have expired).
This causes every writer to wait.
What we want is to isolate the wait only to people performing a write-read sequence, so I think it should be readers that wait. Let's have that debate up front before we start reviewing the patch.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Heikki Linnakangas
Date:
On 11/11/2015 10:23 AM, Simon Riggs wrote: > On 11 November 2015 at 05:37, Thomas Munro <thomas.munro@enterprisedb.com> > wrote: > > Many sites use hot standby servers to spread read-heavy workloads over more >> hardware, or at least would like to. This works well today if your >> application can tolerate some time lag on standbys. The problem is that >> there is no guarantee of when a particular commit will become visible for >> clients connected to standbys. The existing synchronous commit feature is >> no help here because it guarantees only that the WAL has been flushed on >> another server before commit returns. It says nothing about whether it has >> been applied or whether it has been applied on the standby that you happen >> to be talking to. > > Thanks for working on this issue. +1. >> 3. Commit on the primary with "causal_reads = on" waits for all >> 'available' standbys either to apply the commit record, or to cease to be >> 'available' and begin raising the error if they are still alive (because >> their authorizations have expired). >> > > This causes every writer to wait. > > What we want is to isolate the wait only to people performing a write-read > sequence, so I think it should be readers that wait. Let's have that debate > up front before we start reviewing the patch. Agreed. And in the write-read sequence, you don't need to wait at the write either, it's enough that you wait just before you start doing the read. An application might do a lot of other things between the two, so that in most cases, there would in fact be no waiting as the record is already applied when you perform the read. I'm thinking the client should get some kind of a token back from the commit, and it could use the token on the standby, to wait for that commit to be applied. The token could be just the XID, or the LSN of the commit record. Or the application could generate the token and pass it to the server in the commit, similar to how 2PC works. So the interaction would be something like: In master: BEGIN; INSERT INTO FOO ...; COMMIT; Server returns: COMMITted with token 1234 Later, in standby: BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE; SELECT * FROM foo; ... - Heikki
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Atri Sharma
Date:
<p dir="ltr"><br /> > I'm thinking the client should get some kind of a token back from the commit, and it could use thetoken on the standby, to wait for that commit to be applied. The token could be just the XID, or the LSN of the commitrecord. Or the application could generate the token and pass it to the server in the commit, similar to how 2PC works.So the interaction would be something like:<br /> ><br /> > In master:<br /> > BEGIN;<br /> > INSERT INTOFOO ...;<br /> > COMMIT;<br /> > Server returns: COMMITted with token 1234<br /> ><br /> > Later, in standby:<br/> > BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;<br /> > SELECT * FROM foo;<br /><p dir="ltr">+1.<p dir="ltr">TheLSN should be good enough IMO.
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I thought about this question, and considered three different approaches:
1. Reader waits with exposed LSNs, as Heikki suggests. This is what BerkeleyDB does in "read-your-writes" mode. It means that application developers have the responsibility for correctly identifying transactions with causal dependencies and dealing with LSNs (or whatever equivalent tokens), potentially even passing them to other processes where the transactions are causally dependent but run by multiple communicating clients (for example, communicating microservices). This makes it difficult to retrofit load balancing to pre-existing applications and (like anything involving concurrency) difficult to reason about as applications grow in size and complexity. It is efficient if done correctly, but it is a tax on application complexity.
2. Reader waits for a conservatively chosen LSN. This is roughly what MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait = 1" modes. Read transactions would start off by finding the current end of WAL on the primary, since that must be later than any commit that already completed, and then waiting for that to apply locally. That means every read transaction waits for a complete replication lag period, potentially unnecessarily. This is tax on readers with unnecessary waiting.
3. Writer waits, as proposed. In this model, there is no tax on readers (they have zero overhead, aside from the added complexity of dealing with the possibility of transactions being rejected when a standby falls behind and is dropped from 'available' status; but database clients must already deal with certain types of rare rejected queries/failures such as deadlocks, serialization failures, server restarts etc). This is a tax on writers.
My thinking was that the reason for wanting to load balance over a set of hot standbys is because you have a very read-heavy workload, so it makes sense to tax the writers and leave the many dominant readers unburdened, so (3) should be better than (2) for the majority of users who want such a configuration. (Note also that it's not a requirement to tax every write; with this proposal you can set causal_reads to off for those transactions where you know there is no possibility of a causally dependent read).
As for (1), my thinking was that most application developers would probably prefer not to have to deal with that type of interface. For users who do want to do that, it would be comparatively simple to make that possible, and would not conflict with this proposal. This proposal could be used by people retrofitting load balancing to an existing applications with relative ease, or simply not wanting to have to deal with LSNs and complexity. (I have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout) separately, which could be called on a standby with the lsn obtained from pg_current_xlog_location() on the primary any time after a COMMIT completes, but I was thinking of that as a different feature addressing a different user base: people prepared to do more work to squeeze out some extra performance.)
-- On 11/11/2015 10:23 AM, Simon Riggs wrote:On 11 November 2015 at 05:37, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
Many sites use hot standby servers to spread read-heavy workloads over morehardware, or at least would like to. This works well today if your
application can tolerate some time lag on standbys. The problem is that
there is no guarantee of when a particular commit will become visible for
clients connected to standbys. The existing synchronous commit feature is
no help here because it guarantees only that the WAL has been flushed on
another server before commit returns. It says nothing about whether it has
been applied or whether it has been applied on the standby that you happen
to be talking to.
Thanks for working on this issue.
+1.3. Commit on the primary with "causal_reads = on" waits for all
'available' standbys either to apply the commit record, or to cease to be
'available' and begin raising the error if they are still alive (because
their authorizations have expired).
This causes every writer to wait.
What we want is to isolate the wait only to people performing a write-read
sequence, so I think it should be readers that wait. Let's have that debate
up front before we start reviewing the patch.
Agreed. And in the write-read sequence, you don't need to wait at the write either, it's enough that you wait just before you start doing the read. An application might do a lot of other things between the two, so that in most cases, there would in fact be no waiting as the record is already applied when you perform the read.
I'm thinking the client should get some kind of a token back from the commit, and it could use the token on the standby, to wait for that commit to be applied. The token could be just the XID, or the LSN of the commit record. Or the application could generate the token and pass it to the server in the commit, similar to how 2PC works. So the interaction would be something like:
In master:
BEGIN;
INSERT INTO FOO ...;
COMMIT;
Server returns: COMMITted with token 1234
Later, in standby:
BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
SELECT * FROM foo;
...
I thought about this question, and considered three different approaches:
1. Reader waits with exposed LSNs, as Heikki suggests. This is what BerkeleyDB does in "read-your-writes" mode. It means that application developers have the responsibility for correctly identifying transactions with causal dependencies and dealing with LSNs (or whatever equivalent tokens), potentially even passing them to other processes where the transactions are causally dependent but run by multiple communicating clients (for example, communicating microservices). This makes it difficult to retrofit load balancing to pre-existing applications and (like anything involving concurrency) difficult to reason about as applications grow in size and complexity. It is efficient if done correctly, but it is a tax on application complexity.
2. Reader waits for a conservatively chosen LSN. This is roughly what MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait = 1" modes. Read transactions would start off by finding the current end of WAL on the primary, since that must be later than any commit that already completed, and then waiting for that to apply locally. That means every read transaction waits for a complete replication lag period, potentially unnecessarily. This is tax on readers with unnecessary waiting.
3. Writer waits, as proposed. In this model, there is no tax on readers (they have zero overhead, aside from the added complexity of dealing with the possibility of transactions being rejected when a standby falls behind and is dropped from 'available' status; but database clients must already deal with certain types of rare rejected queries/failures such as deadlocks, serialization failures, server restarts etc). This is a tax on writers.
My thinking was that the reason for wanting to load balance over a set of hot standbys is because you have a very read-heavy workload, so it makes sense to tax the writers and leave the many dominant readers unburdened, so (3) should be better than (2) for the majority of users who want such a configuration. (Note also that it's not a requirement to tax every write; with this proposal you can set causal_reads to off for those transactions where you know there is no possibility of a causally dependent read).
As for (1), my thinking was that most application developers would probably prefer not to have to deal with that type of interface. For users who do want to do that, it would be comparatively simple to make that possible, and would not conflict with this proposal. This proposal could be used by people retrofitting load balancing to an existing applications with relative ease, or simply not wanting to have to deal with LSNs and complexity. (I have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout) separately, which could be called on a standby with the lsn obtained from pg_current_xlog_location() on the primary any time after a COMMIT completes, but I was thinking of that as a different feature addressing a different user base: people prepared to do more work to squeeze out some extra performance.)
Thomas Munro
http://www.enterprisedb.com
http://www.enterprisedb.com
On Wed, Nov 11, 2015 at 11:22 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> On 11/11/2015 10:23 AM, Simon Riggs wrote: >>> Thanks for working on this issue. >> >> +1. +1. I have seen a lot of interest for something along these lines. >> I'm thinking the client should get some kind of a token back from the >> commit, and it could use the token on the standby, to wait for that commit >> to be applied. The token could be just the XID, or the LSN of the commit >> record. Or the application could generate the token and pass it to the >> server in the commit, similar to how 2PC works. So the interaction would be >> something like: >> >> In master: >> BEGIN; >> INSERT INTO FOO ...; >> COMMIT; >> Server returns: COMMITted with token 1234 >> >> Later, in standby: >> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE; >> SELECT * FROM foo; >> ... To avoid read anomalies (backwards timetravel) it should also be possible to receive a token from read-only transactions based on the latest snapshot used. > My thinking was that the reason for wanting to load balance over a set of > hot standbys is because you have a very read-heavy workload, so it makes > sense to tax the writers and leave the many dominant readers unburdened, so > (3) should be better than (2) for the majority of users who want such a > configuration. (Note also that it's not a requirement to tax every write; > with this proposal you can set causal_reads to off for those transactions > where you know there is no possibility of a causally dependent read). > > As for (1), my thinking was that most application developers would probably > prefer not to have to deal with that type of interface. For users who do > want to do that, it would be comparatively simple to make that possible, and > would not conflict with this proposal. This proposal could be used by > people retrofitting load balancing to an existing applications with relative > ease, or simply not wanting to have to deal with LSNs and complexity. (I > have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout) > separately, which could be called on a standby with the lsn obtained from > pg_current_xlog_location() on the primary any time after a COMMIT completes, > but I was thinking of that as a different feature addressing a different > user base: people prepared to do more work to squeeze out some extra > performance.) Although I still think that 1) is the correct long term solution I must say that I agree with the reasoning presented. I think we should review the API in the light that in the future we might have a mix of clients, some clients that are able to keep track of causality tokens and either want to wait when a read request arrives, or pick a host to use based on the token, and then there are "dumb" clients that want to use write side waits. Also, it should be possible to configure which standbys are considered for waiting on. Otherwise a reporting slave will occasionally catch up enough to be considered "available" and then cause a latency peak when a long query blocks apply again. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Wed, Nov 11, 2015 at 3:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > This causes every writer to wait. > > What we want is to isolate the wait only to people performing a write-read > sequence, so I think it should be readers that wait. Let's have that debate > up front before we start reviewing the patch. One advantage of having writers wait is that the master and its read slaves can't ever get too far apart. Suppose the master is generating WAL much faster than the read slaves (or one of them) can replay it. You might say it sucks to slow down the master to the speed the slaves can keep up with, and that's true. On the other hand, if the master is allowed to run ahead, then a process that sends a read query to a standby which has gotten far behind might have to wait minutes or hours for it to catch up. I think a lot of people are enabling synchronous replication today just for the purpose of avoiding this problem - keeping the two machines "together in time" makes the overall system behavior a lot more predictable. Also, if we made readers wait, wouldn't that require a network roundtrip to the master every time a query on a reader wanted a new snapshot? That seems like it would be unbearably expensive. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Peter Eisentraut
Date:
On 11/11/15 4:22 AM, Thomas Munro wrote: > My thinking was that the reason for wanting to load balance over a set > of hot standbys is because you have a very read-heavy workload, so it > makes sense to tax the writers and leave the many dominant readers > unburdened, so (3) should be better than (2) for the majority of users > who want such a configuration. One problem I can see is that even if you have a read-heavy workload, the writes can still be a bottleneck, since they are necessarily bound to one node. And so if the feature proposal is, we can make your reads more consistent but the writes will become slower, then that's not a good deal. More generally, no matter whether you pick the writers or the readers to wait, if you assume that read-only slaves are an application performance feature, then it's questionable how much better such applications will perform overall when network-bound waits are introduced in the system. I think in practice applications that are busy enough to worry about this don't really work like that anyway. For example, the writes should go to a message queue and are written out whenever, with a copy kept in a cache for display in the meantime. Maybe there could be additional features to make managing this easier. I think there are a lot of different variations of this in practice, not only depending on the workload and other measurables, but also business-dependent decisions on application behavior and degradability.
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Thu, Nov 12, 2015 at 12:10 AM, Ants Aasma <ants.aasma@eesti.ee> wrote:
Exactly!
Good point. Here's a new version which adds the GUC causal_reads_standby_names, defaulting to '*' (but as before, the feature is not activated until you set causal_reads_timeout). Now you can list standby names explicitly if you want a way to exclude certain standbys. Also, I noticed that cascaded standbys shouldn't be available for causal reads, so I added a check for that.
On Wed, Nov 11, 2015 at 11:22 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> On 11/11/2015 10:23 AM, Simon Riggs wrote:
>>> Thanks for working on this issue.
>>
>> +1.
+1. I have seen a lot of interest for something along these lines.
>> I'm thinking the client should get some kind of a token back from the
>> commit, and it could use the token on the standby, to wait for that commit
>> to be applied. The token could be just the XID, or the LSN of the commit
>> record. Or the application could generate the token and pass it to the
>> server in the commit, similar to how 2PC works. So the interaction would be
>> something like:
>>
>> In master:
>> BEGIN;
>> INSERT INTO FOO ...;
>> COMMIT;
>> Server returns: COMMITted with token 1234
>>
>> Later, in standby:
>> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
>> SELECT * FROM foo;
>> ...
To avoid read anomalies (backwards timetravel) it should also be
possible to receive a token from read-only transactions based on the
latest snapshot used.
> My thinking was that the reason for wanting to load balance over a set of
> hot standbys is because you have a very read-heavy workload, so it makes
> sense to tax the writers and leave the many dominant readers unburdened, so
> (3) should be better than (2) for the majority of users who want such a
> configuration. (Note also that it's not a requirement to tax every write;
> with this proposal you can set causal_reads to off for those transactions
> where you know there is no possibility of a causally dependent read).
>
> As for (1), my thinking was that most application developers would probably
> prefer not to have to deal with that type of interface. For users who do
> want to do that, it would be comparatively simple to make that possible, and
> would not conflict with this proposal. This proposal could be used by
> people retrofitting load balancing to an existing applications with relative
> ease, or simply not wanting to have to deal with LSNs and complexity. (I
> have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout)
> separately, which could be called on a standby with the lsn obtained from
> pg_current_xlog_location() on the primary any time after a COMMIT completes,
> but I was thinking of that as a different feature addressing a different
> user base: people prepared to do more work to squeeze out some extra
> performance.)
Although I still think that 1) is the correct long term solution I
must say that I agree with the reasoning presented. I think we should
review the API in the light that in the future we might have a mix of
clients, some clients that are able to keep track of causality tokens
and either want to wait when a read request arrives, or pick a host to
use based on the token, and then there are "dumb" clients that want to
use write side waits.
Exactly!
I see the causality tokens approach (thank you for that terminology) not so much as a "long term" solution, but rather as an expert feature likely to interest a small number of sophisticated users willing to take on more responsibility in exchange for greater control. We should definitely add support for that, and I expect the patch would be fairly simple and short.
But I believe the vast majority of users would like to be able to run new and existing plain SQL on any node and see the data they just wrote, with graceful failure modes, and without extra conceptual load or invasive code changes. So I think we should cater for that mode of usage that too.
But I believe the vast majority of users would like to be able to run new and existing plain SQL on any node and see the data they just wrote, with graceful failure modes, and without extra conceptual load or invasive code changes. So I think we should cater for that mode of usage that too.
Also, it should be possible to configure which standbys are considered
for waiting on. Otherwise a reporting slave will occasionally catch up
enough to be considered "available" and then cause a latency peak when
a long query blocks apply again.
Good point. Here's a new version which adds the GUC causal_reads_standby_names, defaulting to '*' (but as before, the feature is not activated until you set causal_reads_timeout). Now you can list standby names explicitly if you want a way to exclude certain standbys. Also, I noticed that cascaded standbys shouldn't be available for causal reads, so I added a check for that.
Thomas Munro
http://www.enterprisedb.com
http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Simon Riggs
Date:
On 11 November 2015 at 09:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
--
1. Reader waits with exposed LSNs, as Heikki suggests. This is what BerkeleyDB does in "read-your-writes" mode. It means that application developers have the responsibility for correctly identifying transactions with causal dependencies and dealing with LSNs (or whatever equivalent tokens), potentially even passing them to other processes where the transactions are causally dependent but run by multiple communicating clients (for example, communicating microservices). This makes it difficult to retrofit load balancing to pre-existing applications and (like anything involving concurrency) difficult to reason about as applications grow in size and complexity. It is efficient if done correctly, but it is a tax on application complexity.
Agreed. This works if you have a single transaction connected thru a pool that does statement-level load balancing, so it works in both session and transaction mode.
I was in favour of a scheme like this myself, earlier, but have more thoughts now.
We must also consider the need for serialization across sessions or transactions.
In transaction pooling mode, an application could get assigned a different session, so a token would be much harder to pass around.
2. Reader waits for a conservatively chosen LSN. This is roughly what MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait = 1" modes. Read transactions would start off by finding the current end of WAL on the primary, since that must be later than any commit that already completed, and then waiting for that to apply locally. That means every read transaction waits for a complete replication lag period, potentially unnecessarily. This is tax on readers with unnecessary waiting.
This tries to make it easier for users by forcing all users to experience a causality delay. Given the whole purpose of multi-node load balancing is performance, referencing the master again simply defeats any performance gain, so you couldn't ever use it for all sessions. It could be a USERSET parameter, so could be turned off in most cases that didn't need it. But its easier to use than (1).
Though this should be implemented in the pooler.
3. Writer waits, as proposed. In this model, there is no tax on readers (they have zero overhead, aside from the added complexity of dealing with the possibility of transactions being rejected when a standby falls behind and is dropped from 'available' status; but database clients must already deal with certain types of rare rejected queries/failures such as deadlocks, serialization failures, server restarts etc). This is a tax on writers.
This would seem to require that all readers must first check with the master as to which standbys are now considered available, so it looks like (2).
The alternative is that we simply send readers to any standby and allow the pool to work out separately whether the standby is still available, which mostly works, but it doesn't handle sporadic slow downs on particular standbys very well (if at all).
I think we need to look at whether this does actually give us anything, or whether we are missing the underlying Heisenberg reality.
More later.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 11 November 2015 at 09:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:1. Reader waits with exposed LSNs, as Heikki suggests. This is what BerkeleyDB does in "read-your-writes" mode. It means that application developers have the responsibility for correctly identifying transactions with causal dependencies and dealing with LSNs (or whatever equivalent tokens), potentially even passing them to other processes where the transactions are causally dependent but run by multiple communicating clients (for example, communicating microservices). This makes it difficult to retrofit load balancing to pre-existing applications and (like anything involving concurrency) difficult to reason about as applications grow in size and complexity. It is efficient if done correctly, but it is a tax on application complexity.Agreed. This works if you have a single transaction connected thru a pool that does statement-level load balancing, so it works in both session and transaction mode.I was in favour of a scheme like this myself, earlier, but have more thoughts now.We must also consider the need for serialization across sessions or transactions.In transaction pooling mode, an application could get assigned a different session, so a token would be much harder to pass around.2. Reader waits for a conservatively chosen LSN. This is roughly what MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait = 1" modes. Read transactions would start off by finding the current end of WAL on the primary, since that must be later than any commit that already completed, and then waiting for that to apply locally. That means every read transaction waits for a complete replication lag period, potentially unnecessarily. This is tax on readers with unnecessary waiting.This tries to make it easier for users by forcing all users to experience a causality delay. Given the whole purpose of multi-node load balancing is performance, referencing the master again simply defeats any performance gain, so you couldn't ever use it for all sessions. It could be a USERSET parameter, so could be turned off in most cases that didn't need it. But its easier to use than (1).Though this should be implemented in the pooler.3. Writer waits, as proposed. In this model, there is no tax on readers (they have zero overhead, aside from the added complexity of dealing with the possibility of transactions being rejected when a standby falls behind and is dropped from 'available' status; but database clients must already deal with certain types of rare rejected queries/failures such as deadlocks, serialization failures, server restarts etc). This is a tax on writers.This would seem to require that all readers must first check with the master as to which standbys are now considered available, so it looks like (2).
No -- in (3), that is this proposal, standbys don't check with the primary when you run a transaction. Instead, the primary sends a constant stream of authorizations (in the form of keepalives sent every causal_reads_timeout / 2 in the current patch) to the standby, allowing it to consider itself available for a short time into the future (currently now + causal_reads_timeout - max_tolerable_clock_skew to be specific -- I can elaborate on that logic in a separate email). At the start of a transaction in causal reads mode (the first call to GetTransaction to be specific), the standby knows immediately without communicating with the primary whether it can proceed or must raise the error. In the happy case, the reader simply compares the most recently received authorization's expiry time with the system clock and proceeds. In the worst case, when contact is lost between primary and standby, the primary must stall causal_reads commits for causal_reads_timeout (see CausalReadsBeginStall). Doing that makes sure that no causal reads commit can return (see CausalReadsCommitCanReturn) before the lost standby has definitely started raising the error for causal_reads queries (because its most recent authorization has expired), in case it is still alive and handling requests from clients.
It is not at all like (2), which introduces a conservative wait at the start of every read transaction, slowing all readers down. In (3), readers don't wait, they run (or are rejected) as fast as possible, but instead the primary has to do extra things. Hence my categorization of (2) as a 'tax on readers', and of (3) as a 'tax on writers'. The idea is that a site with a high ratio of reads to writes would prefer zero-overhead reads.
It is not at all like (2), which introduces a conservative wait at the start of every read transaction, slowing all readers down. In (3), readers don't wait, they run (or are rejected) as fast as possible, but instead the primary has to do extra things. Hence my categorization of (2) as a 'tax on readers', and of (3) as a 'tax on writers'. The idea is that a site with a high ratio of reads to writes would prefer zero-overhead reads.
The alternative is that we simply send readers to any standby and allow the pool to work out separately whether the standby is still available, which mostly works, but it doesn't handle sporadic slow downs on particular standbys very well (if at all).
This proposal does handle sporadic slowdowns on standbys: it drops them from the set of available standbys if they don't apply fast enough, all the while maintaining the guarantee. Though occurs to me that it probably needs some kind of defence against too much flapping between available and unavailable (maybe some kind of back off on the 'joining' phase that standbys go through when they transition from unavailable to available in the current patch, which I realize I haven't described yet -- but I don't want to get bogged down in details, while we're talking about the 30,000 foot view).
I think we need to look at whether this does actually give us anything, or whether we are missing the underlying Heisenberg reality.
Thomas Munro
http://www.enterprisedb.com
http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Sorry for the double reply, I just wanted to add a couple more thoughts.
-- On 11 November 2015 at 09:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:1. Reader waits with exposed LSNs, as Heikki suggests. This is what BerkeleyDB does in "read-your-writes" mode. It means that application developers have the responsibility for correctly identifying transactions with causal dependencies and dealing with LSNs (or whatever equivalent tokens), potentially even passing them to other processes where the transactions are causally dependent but run by multiple communicating clients (for example, communicating microservices). This makes it difficult to retrofit load balancing to pre-existing applications and (like anything involving concurrency) difficult to reason about as applications grow in size and complexity. It is efficient if done correctly, but it is a tax on application complexity.Agreed. This works if you have a single transaction connected thru a pool that does statement-level load balancing, so it works in both session and transaction mode.I was in favour of a scheme like this myself, earlier, but have more thoughts now.We must also consider the need for serialization across sessions or transactions.In transaction pooling mode, an application could get assigned a different session, so a token would be much harder to pass around.
Sorry for the double reply, I just wanted to add a couple more thoughts.
As discussed elsewhere in the thread, I think it makes absolute sense to offer some kind of support for causality tokens, I don't see that on its own as enough for most users. (At the least, it would be good to have pg_wait_for_xlog_replay_location(lsn, timeout), but perhaps explicit BEGIN syntax as suggested by Heikki, or a new field in the libpq protocol which can be attached to any statement, and likewise for the commit LSN of results).
It's true that a pooling system/middleware could spy on your sessions and insert causality token handling imposing a global ordering of visibility for you, so that naive users don't have to deal with them. Whenever it sees a COMMIT result (assuming they are taught to return LSNs), it could update a highest-LSN-seen variable, and transparently insert a wait for that LSN into every transaction that it sees beginning. But then you would have to push all your queries through a single point that can see everything across all Postgres servers, and maintain this global high LSN.
In contrast, my writer-waits proposal makes different trade-offs to provide causal reads as a built-in feature without an external single point observer of all transactions.
Thomas Munro
http://www.enterprisedb.com
http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Simon Riggs
Date:
On 12 November 2015 at 18:25, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
--
I don't want to get bogged down in details, while we're talking about the 30,000 foot view).
Hmm, if that's where we're at, I'll summarize my thoughts.
All of this discussion presupposes we are distributing/load balancing queries so that reads and writes might occur on different nodes.
We need a good balancer. Any discussion of this that ignores the balancer component is only talking about half the solution. What we need to do is decide whether functionality should live in the balancer or the core.
Your option (1) is viable, but only in certain cases. We could add support for some token/wait mechanism but as you say, this would require application changes not pooler changes.
Your option (2) is wider but also worse in some ways. It can be implemented in a pooler.
Your option (3) doesn't excite me much. You've got a load of stuff that really should happen in a pooler. And at its core we have synchronous_commit = apply but with a timeout rather than a wait. So anyway, consider me nudged to finish my patch to provide capability for that by 1 Jan.
On a related note, any further things like "GUC causal_reads_standby_names" should be implemented by Node Registry as a named group of nodes. We can have as many arbitrary groups of nodes as we want. If that sounds strange look back at exactly why GUCs are called GUCs.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Hmm, if that's where we're at, I'll summarize my thoughts. > > All of this discussion presupposes we are distributing/load balancing > queries so that reads and writes might occur on different nodes. Agreed. I think that's a pretty common pattern, though certainly not the only one. > We need a good balancer. Any discussion of this that ignores the balancer > component is only talking about half the solution. What we need to do is > decide whether functionality should live in the balancer or the core. I'm all in favor of having a load-balancer in core, but that seems completely unrelated to the patch at hand. > Your option (1) is viable, but only in certain cases. We could add support > for some token/wait mechanism but as you say, this would require application > changes not pooler changes. Agreed. > Your option (2) is wider but also worse in some ways. It can be implemented > in a pooler. > > Your option (3) doesn't excite me much. You've got a load of stuff that > really should happen in a pooler. And at its core we have synchronous_commit > = apply but with a timeout rather than a wait. So anyway, consider me nudged > to finish my patch to provide capability for that by 1 Jan. I don't see how either option (2) or option (3) could be implemented in a pooler. How would that work? To be frank, it's starting to seem to me like you are just trying to block this patch so you can have time to develop your own version instead. I hope that's not the case, because it would be quite unfair. When Thomas originally posted the patch, you complained that "This causes every writer to wait. What we want is to isolate the wait only to people performing a write-read sequence, so I think it should be readers that wait. Let's have that debate up front before we start reviewing the patch." Now, you seem to be saying that's OK, because you want to post a patch to do exactly the same thing under the name synchronous_commit=apply, but you want it to be your own patch, leaving out the other stuff that Thomas has put into this one. That could be the right thing to do, but how about we discuss it a bit? The timeout stuff that Thomas has added here is really useful. Without that, if a machine goes down, we wait forever. That's the right thing to do if we're replicating to make sure transactions can never be lost, but it's a bad idea if we're replicating for load balancing. In the load balancing case, you want to drop sync slaves quickly to ensure the cluster remains available, and you need them to know they are out of sync so that the load balancer doesn't get confused. That's exactly what is implemented here. If you have an idea for a simpler implementation, great, but I think we need something. I don't see how it's going to work to make it entirely the pooler's job to figure out whether the cluster is in sync - it needs a push from the core server. Here, that push is easy to find: if a particular replica starts returning the "i'm out of sync" error when you query it, then stop routing queries to that replica until the error clears (which the pooler can find out by polling it with some trivial query). That's a great deal more useful than synchronous_commit=apply without such a feature. > On a related note, any further things like "GUC causal_reads_standby_names" > should be implemented by Node Registry as a named group of nodes. We can > have as many arbitrary groups of nodes as we want. If that sounds strange > look back at exactly why GUCs are called GUCs. I think a node registry is a good idea, and my impression from the session in Vienna is quite a few other hackers do, too. But I also don't think it's remotely reasonable to make that a precondition for accepting this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Sun, Nov 15, 2015 at 11:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 12 November 2015 at 18:25, Thomas Munro <thomas.munro@enterprisedb.com> wrote:I don't want to get bogged down in details, while we're talking about the 30,000 foot view).Hmm, if that's where we're at, I'll summarize my thoughts.All of this discussion presupposes we are distributing/load balancing queries so that reads and writes might occur on different nodes.We need a good balancer. Any discussion of this that ignores the balancer component is only talking about half the solution. What we need to do is decide whether functionality should live in the balancer or the core.Your option (1) is viable, but only in certain cases. We could add support for some token/wait mechanism but as you say, this would require application changes not pooler changes.Your option (2) is wider but also worse in some ways. It can be implemented in a pooler.Your option (3) doesn't excite me much. You've got a load of stuff that really should happen in a pooler. And at its core we have synchronous_commit = apply but with a timeout rather than a wait. So anyway, consider me nudged to finish my patch to provide capability for that by 1 Jan.
Just to be clear, this patch doesn't use a "timeout rather than a wait". It always waits for the current set of available causal reads standbys to apply the commit. It's just that nodes get kicked out of that set pretty soon if they don't keep up, a bit like a RAID controller dropping a failing disk. And it does so using a protocol that ensures that the dropped standby starts raising the error, even if contact has been lost with it, so the causal reads guarantee is maintained at all times for all clients.
On a related note, any further things like "GUC causal_reads_standby_names" should be implemented by Node Registry as a named group of nodes. We can have as many arbitrary groups of nodes as we want. If that sounds strange look back at exactly why GUCs are called GUCs.
Agreed, the application_name whitelist stuff is clunky. I left it out of the first version I posted, not wanting the focus of this proposal to be side-tracked. But as Ants Aasma pointed out, some users might need something like that, so I posted a 2nd version that follows the established example, again not wanting to distract with anything new in that area. Of course that would eventually be replaced/improved as part of a future node topology management project.
Thomas Munro
http://www.enterprisedb.com
http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Simon Riggs
Date:
On 15 November 2015 at 14:50, Robert Haas <robertmhaas@gmail.com> wrote:
--
On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Hmm, if that's where we're at, I'll summarize my thoughts.
>
> All of this discussion presupposes we are distributing/load balancing
> queries so that reads and writes might occur on different nodes.
Agreed. I think that's a pretty common pattern, though certainly not
the only one.
It looks to me this functionality is only of use in a pooler. Please explain how else this would be used.
> Your option (2) is wider but also worse in some ways. It can be implemented
> in a pooler.
>
> Your option (3) doesn't excite me much. You've got a load of stuff that
> really should happen in a pooler. And at its core we have synchronous_commit
> = apply but with a timeout rather than a wait.
I don't see how either option (2) or option (3) could be implemented
in a pooler. How would that work?
My starting thought was that (1) was the only way forwards. Through discussion, I now see that its not the best solution for the general case.
The pooler knows which statements are reads and writes, it also knows about transaction boundaries, so it is possible for it to perform the waits for either (2) or (3). The pooler *needs* to know which nodes it can route queries to, so it looks to me that the pooler is the best place to put waits and track status of nodes, no matter when we wait. I don't see any benefit in having other nodes keep track of node status since that will just replicate work that *must* be performed in the pooler.
I would like to see a load balancing pooler in Postgres.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Craig Ringer
Date:
On 16 November 2015 at 18:44, Simon Riggs <simon@2ndquadrant.com> wrote:
The pooler knows which statements are reads and writes
I think that's an iffy assumption. It's one we tend to make because otherwise read/write pooling won't work, but in PostgreSQL there's really no way to know when calling a function.
What does
SELECT get_user_stats()
do? The pooler has _no_ _idea_ unless manually configured with knowledge about particular user defined functions.
In the absence of such knowledge it can:
- send the work to a replica and report the ERROR to the user if it fails due to an attempted write;
- send the work to a replica, capture an ERROR due to attempted write, and retry on the master;
- send everything it's not sure about to the master
If a pooler had insight into the catalogs and if we had readonly / readwrite attributes on functions, it could be smarter.
I would like to see a load balancing pooler in Postgres.
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Simon Riggs
Date:
On 16 November 2015 at 11:01, Craig Ringer <craig@2ndquadrant.com> wrote:
--
On 16 November 2015 at 18:44, Simon Riggs <simon@2ndquadrant.com> wrote:The pooler knows which statements are reads and writesI think that's an iffy assumption.
It's not an assumption, its a requirement. If it can't do this in some manner then you can't use a load balancing pooler.
Randomly submitting things works as well, since it leads to a write error when you try to write data on a read only server, so you do then learn whether it is a read or a write. Once you know its a write, you submit to master. But you still need to be careful of other effects, so that isn't recommended.
It's one we tend to make because otherwise read/write pooling won't work, but in PostgreSQL there's really no way to know when calling a function.What doesSELECT get_user_stats()do? The pooler has _no_ _idea_ unless manually configured with knowledge about particular user defined functions.In the absence of such knowledge it can:- send the work to a replica and report the ERROR to the user if it fails due to an attempted write;- send the work to a replica, capture an ERROR due to attempted write, and retry on the master;- send everything it's not sure about to the masterIf a pooler had insight into the catalogs and if we had readonly / readwrite attributes on functions, it could be smarter.
pgpool supports white/black function listing for exactly this reason.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Simon Riggs
Date:
On 15 November 2015 at 10:41, Simon Riggs <simon@2ndquadrant.com> wrote:
--
So anyway, consider me nudged to finish my patch to provide capability for that by 1 Jan.
My earlier patch aimed to allow WALReceiver to wait on both a latch and a socket as well as allow WALWriter to be active, so that WALReceiver could reply more quickly and handle greater workload. As I explained previously when we discussed that in recent posts, it is necessary infrastructure to have anybody wait on anything higher than remote-fsync. I aim to complete that by 1 Jan.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Tue, Nov 17, 2015 at 12:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 15 November 2015 at 10:41, Simon Riggs <simon@2ndquadrant.com> wrote:So anyway, consider me nudged to finish my patch to provide capability for that by 1 Jan.My earlier patch aimed to allow WALReceiver to wait on both a latch and a socket as well as allow WALWriter to be active, so that WALReceiver could reply more quickly and handle greater workload. As I explained previously when we discussed that in recent posts, it is necessary infrastructure to have anybody wait on anything higher than remote-fsync. I aim to complete that by 1 Jan.
Right, handing write/fsync work off to WALWriter in standbys makes a lot of sense for any kind of writer-waits system, so that WALReceiver doesn't spend time in long syscalls which wouldn't play nicely with signals (whether from 'kill' or SetLatch) and can deal with network IO with the lowest possible latency. I would like to help test/review that, if that could be useful.
The SIGUSR1 code in the WalReceiverMain and WalRecvWakeup in this patch works well enough for now for proof-of-concept purposes until then.
The SIGUSR1 code in the WalReceiverMain and WalRecvWakeup in this patch works well enough for now for proof-of-concept purposes until then.
Thomas Munro
http://www.enterprisedb.com
http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Mon, Nov 16, 2015 at 5:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 15 November 2015 at 14:50, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs <simon@2ndquadrant.com> >> wrote: >> > Hmm, if that's where we're at, I'll summarize my thoughts. >> > >> > All of this discussion presupposes we are distributing/load balancing >> > queries so that reads and writes might occur on different nodes. >> >> Agreed. I think that's a pretty common pattern, though certainly not >> the only one. > It looks to me this functionality is only of use in a pooler. Please explain > how else this would be used. I think you're right. I mean, you could have the pooling built into the application, but some place connections have to be being farmed out to different nodes, or there's no point to using this feature. Some people may not want to use this feature, but those who do are using some form of pooling at some level. >> > Your option (2) is wider but also worse in some ways. It can be >> > implemented >> > in a pooler. >> > >> > Your option (3) doesn't excite me much. You've got a load of stuff that >> > really should happen in a pooler. And at its core we have >> > synchronous_commit >> > = apply but with a timeout rather than a wait. >> >> I don't see how either option (2) or option (3) could be implemented >> in a pooler. How would that work? > > My starting thought was that (1) was the only way forwards. Through > discussion, I now see that its not the best solution for the general case. > > The pooler knows which statements are reads and writes, it also knows about > transaction boundaries, so it is possible for it to perform the waits for > either (2) or (3). As Craig says, it may not: pgbouncer, for example, won't. pgpool will, except when it's wrong because some apparently read-only function is actually writing data. But even if the pooler does know, that isn't enough for it to perform the waits for (2) or (3) without some support for the server. If it wants to wait for a particular transaction to be applied on the standby, it needs to know how long to wait, and without some support for the server, it has no way of knowing. Now that could be done by doing (1) and then having the pooler perform the waits, but now every pooler has to be taught how to do that. pgpool needs to know, pgbouncer needs to know, every JDBC-based connection pooler needs to know. Uggh. Thomas's solution is nice because it works with any pooler. The other point I'd make, which I think may be what you said before but I think is worth making very explicit, is that (1) supposes that we know which reads are dependent on which previous writes. In the real world, that's probably frequently untrue. If someone does SELECT sum(salary) FROM emp on the standby, there's no particular write to the emp table that they want to wait for: they want to wait for ALL such writes previously acknowledged as committed. Now, even when the dependent writes can be identified, it may be convenient to just wait for all of them instead of a particular subset that we know are related. But I bet there will be many cases where identification is impractical or impossible, and thus I suspect (1) won't be very useful. I think (2) and (3) both have good prospects for being useful, but I suspect that the performance consequences of (3), which is what Thomas actually implemented, although possibly severe, are still likely to be only a fraction of the cost of (2). Having to potentially wait every time a standby takes a snapshot just sounds awful to me. > I would like to see a load balancing pooler in Postgres. Me, too, but I don't expect that to be possible in the near future, and I think this is awfully useful until it does. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/12/15 1:11 PM, Thomas Munro wrote: > It's true that a pooling system/middleware could spy on your sessions > and insert causality token handling imposing a global ordering of > visibility for you, so that naive users don't have to deal with them. > Whenever it sees a COMMIT result (assuming they are taught to return > LSNs), it could update a highest-LSN-seen variable, and transparently > insert a wait for that LSN into every transaction that it sees > beginning. But then you would have to push all your queries through a > single point that can see everything across all Postgres servers, and > maintain this global high LSN. I think that depends on what you're doing. Frequently you don't care about anyone elses writes, just your own. In that case, there's no need for a shared connection pooler, you just have to come back to the same one. There's also a 4th option: until a commit has made it out to some number of slaves, re-direct all reads from a session back to the master. That might sound horrible for master performance, but in reality I think it'd normally be fine. Generally, you only care about this when you're going to read data that you've just written, which means the data's still in shared buffers. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
Hi,
Here is a new version of the patch with a few small improvements:
1. Adopted the term '[read] lease', replacing various hand-wavy language in the comments and code. That seems to be the established term for this approach[1].
2. Reduced the stalling time on failure. When things go wrong with a standby (such as losing contact with it), instead of stalling for a conservative amount of time longer than any lease that might have been granted, the primary now stalls only until the expiry of the last lease that actually was granted to a given dropped standby, which should be sooner.
3. Fixed a couple of bugs that showed up in testing and review (some bad flow control in the signal handling, and a bug in a circular buffer), and changed the recovery->walreceiver wakeup signal handling to block the signal except while waiting in walrcv_receive (it didn't seem a good idea to interrupt arbitrary syscalls in walreceiver so I thought that would be a improvement; but of course that area's going to be reworked by Simon's patch anyway, as discussed elsewhere).
3. Fixed a couple of bugs that showed up in testing and review (some bad flow control in the signal handling, and a bug in a circular buffer), and changed the recovery->walreceiver wakeup signal handling to block the signal except while waiting in walrcv_receive (it didn't seem a good idea to interrupt arbitrary syscalls in walreceiver so I thought that would be a improvement; but of course that area's going to be reworked by Simon's patch anyway, as discussed elsewhere).
Restating the central idea using the new terminology: So long as they are replaying fast enough, the primary grants a series of causal reads leases to standbys allowing them to handle causal reads queries locally without any inter-node communication for a limited time. Leases are promises that the primary will wait for the standby to apply commit records OR be dropped from the set of available causal reads standbys and know that it has been dropped, before the primary returns from commit, in order to uphold the causal reads guarantee. In the worst case it can do that by waiting for the most recently granted lease to expire.
I've also attached a couple of things which might be useful when trying the patch out: test-causal-reads.c which can be used to test performance and causality under various conditions, and test-causal-reads.sh which can be used to bring up a primary and a bunch of local hot standbys to talk to. (In the hope of encouraging people to take the patch for a spin...)
[1] Originally from a well known 1989 paper on caching, but in the context of databases and synchronous replication see for example the recent papers on "Niobe" and "Paxos Quorum Leases" (especially the reference to Google Megastore). Of course a *lot* more is going on in those very different algorithms, but at some level "read leases" are being used to allow local-node-only reads for a limited time while upholding some kind of global consistency guarantee, in some of those consensus database systems. I spent a bit of time talking about consistency levels to database guru and former colleague Alex Scotti who works on a Paxos-based system, and he gave me the initial idea to try out a lease-based consistency system for Postgres streaming rep. It seems like a very useful point in the space of trade-offs to me.
Thomas Munro
http://www.enterprisedb.com
http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Wed, Nov 18, 2015 at 11:50 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here is a new version of the patch with a few small improvements: > ... > [causal-reads-v3.patch] That didn't apply after 6e7b3359 (which fix a typo in a comment that I moved). Here is a new version that does. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Joel Jacobson
Date:
+1 to both the feature and the concept of how it's implemented. Haven't looked at the code though. This feature would be very useful for us at Trustly. This would mean we got get rid of an entire system component in our architecture (=memcached) which we only use to write data which must be immediately readable at the sync slave after the master commits. The only such data we currently have is the backoffice sessionid which must be readable on the slave, otherwise the read-only calls which we route to the slave might fail because it's missing. On Wed, Nov 11, 2015 at 6:37 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > Hi hackers, > > Many sites use hot standby servers to spread read-heavy workloads over more hardware, or at least would like to. Thisworks well today if your application can tolerate some time lag on standbys. The problem is that there is no guaranteeof when a particular commit will become visible for clients connected to standbys. The existing synchronous commitfeature is no help here because it guarantees only that the WAL has been flushed on another server before commit returns. It says nothing about whether it has been applied or whether it has been applied on the standby that you happento be talking to. > > A while ago I posted a small patch[1] to allow synchronous_commit to wait for remote apply on the current synchronous standby,but (as Simon Riggs rightly pointed out in that thread) that part isn't the main problem. It seems to me that themain problem for a practical 'writer waits' system is how to support a dynamic set of servers, gracefully tolerating failuresand timing out laggards, while also providing a strong guarantee during any such transitions. Now I would like topropose something to do that, and share a proof-of-concept patch. > > > === PROPOSAL === > > The working name for the proposed feature is "causal reads", because it provides a form of "causal consistency"[2] (and"read-your-writes" consistency) no matter which server the client is connected to. There is a similar feature by thesame name in another product (albeit implemented differently -- 'reader waits'; more about that later). I'm not weddedto the name. > > The feature allows arbitrary read-only transactions to be run on any hot standby, with a specific guarantee about the visibilityof preceding transactions. The guarantee is that if you set a new GUC "causal_reads = on" in any pair of consecutivetransactions (tx1, tx2) where tx2 begins after tx1 successfully returns, then tx2 will either see tx1 or failwith a new error "standby is not available for causal reads", no matter which server it runs on. A discovery mechanismis also provided, giving an instantaneous snapshot of the set of standbys that are currently available for causalreads (ie won't raise the error), in the form of a new column in pg_stat_replication. > > For example, a web server might run tx1 to insert a new row representing a message in a discussion forum on the primaryserver, and then send the user to another web page that runs tx2 to load all messages in the forum on an arbitraryhot standby server. If causal_reads = on in both tx1 and tx2 (for example, because it's on globally), then tx2is guaranteed to see the new post, or get a (hopefully rare) error telling the client to retry on another server. > > Very briefly, the approach is: > 1. The primary tracks apply lag on each standby (including between commits). > 2. The primary deems standbys 'available' for causal reads if they are applying WAL and replying to keepalives fast enough,and periodically sends the standby an authorization to consider itself available for causal reads until a time inthe near future. > 3. Commit on the primary with "causal_reads = on" waits for all 'available' standbys either to apply the commit record,or to cease to be 'available' and begin raising the error if they are still alive (because their authorizations haveexpired). > 4. Standbys can start causal reads transactions only while they have an authorization with an expiry time in the future;otherwise they raise an error when an initial snapshot is taken. > > In a follow-up email I can write about the design trade-offs considered (mainly 'writer waits' vs 'reader waits'), comparisonwith some other products, method of estimating replay lag, wait and timeout logic and how it maintains the guaranteein various failure scenarios, logic for standbys joining and leaving, implications of system clock skew betweenservers, or any other questions you may have, depending on feedback/interest (but see comments in the attached patchfor some of those subjects). For now I didn't want to clog up the intertubes with too large a wall of text. > > > === PROOF-OF-CONCEPT === > > Please see the POC patch attached. It adds two new GUCs. After setting up one or more hot standbys as per usual, simplyadd "causal_reads_timeout = 4s" to the primary's postgresql.conf and restart. Now, you can set "causal_reads = on"in some/all sessions to get guaranteed causal consistency. Expected behaviour: the causal reads guarantee is maintainedat all times, even when you overwhelm, kill, crash, disconnect, restart, pause, add and remove standbys, and theprimary drops them from the set it waits for in a timely fashion. You can monitor the system with the replay_lag andcausal_reads_status in pg_stat_replication and some state transition LOG messages on the primary. (The patch also supports"synchronous_commit = apply", but it's not clear how useful that is in practice, as already discussed.) > > Lastly, a few notes about how this feature related to some other work: > > The current version of this patch has causal_reads as a feature separate from synchronous_commit, from a user's point ofview. The thinking behind this is that load balancing and data loss avoidance are separate concerns: synchronous_commitdeals with the latter, and causal_reads with the former. That said, existing SyncRep machinery is obviouslyused (specifically SyncRep queues, with a small modification, as a way to wait for apply messages to arrive fromstandbys). (An earlier prototype had causal reads as a new level for synchronous_commit and associated states as newwalsender states above 'streaming'. When contemplating how to combine this proposal with the multiple-synchronous-standbypatch, some colleagues and I came around to the view that the concerns are separate. The reasonfor wanting to configure complicated quorum definitions is to control data loss risks and has nothing to do with loadbalancing requirements, so we thought the features should probably be separate.) > > The multiple-synchronous-servers patch[3] could be applied or not independently of this feature as a result of that separation,as it doesn't use synchronous_standby_names or indeed any kind of statically defined quorum. > > The standby WAL writer patch[4] would significantly improve walreceiver performance and smoothness which would work verywell with this proposal. > > Please let me know what you think! > > Thanks, > > > [1] http://www.postgresql.org/message-id/flat/CAEepm=1fqkivL4V-OTPHwSgw4aF9HcoGiMrCW-yBtjipX9gsag@mail.gmail.com > > [2] From http://queue.acm.org/detail.cfm?id=1466448 > > "Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access byprocess B will return the updated value, and a write is guaranteed to supersede the earlier write. Access by process Cthat has no causal relationship to process A is subject to the normal eventual consistency rules. > > Read-your-writes consistency. This is an important model where process A, after it has updated a data item, always accessesthe updated value and will never see an older value. This is a special case of the causal consistency model." > > [3] http://www.postgresql.org/message-id/flat/CAOG9ApHYCPmTypAAwfD3_V7sVOkbnECFivmRc1AxhB40ZBSwNQ@mail.gmail.com > > [4] http://www.postgresql.org/message-id/flat/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ@mail.gmail.com > > -- > Thomas Munro > http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Joel Jacobson Mobile: +46703603801 Trustly.com | Newsroom | LinkedIn | Twitter
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Wed, Dec 30, 2015 at 5:15 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Nov 18, 2015 at 11:50 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Here is a new version of the patch with a few small improvements: >> ... >> [causal-reads-v3.patch] > > That didn't apply after 6e7b3359 (which fix a typo in a comment that I > moved). Here is a new version that does. That one conflicts with b1a9bad9e744857291c7d5516080527da8219854, so here is a new version. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Wed, Jan 20, 2016 at 1:12 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Dec 30, 2015 at 5:15 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Wed, Nov 18, 2015 at 11:50 PM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> Here is a new version of the patch with a few small improvements: >>> ... >>> [causal-reads-v3.patch] >> >> That didn't apply after 6e7b3359 (which fix a typo in a comment that I >> moved). Here is a new version that does. > > That one conflicts with b1a9bad9e744857291c7d5516080527da8219854, so > here is a new version. You should try to register it to a CF, though it may be too late for 9.6 if that's rather intrusive. -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Amit Langote
Date:
Hi Thomas, On 2016/01/20 13:12, Thomas Munro wrote: > That one conflicts with b1a9bad9e744857291c7d5516080527da8219854, so > here is a new version. - if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write) + if (is_highest_priority_sync_standby) [ ... ] - if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) - { - walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush; - numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH); [ ... ] + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->write) + { + walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush; + numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, + MyWalSnd->flush); There seems to be a copy-pasto there - shouldn't that be: + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) Thanks, Amit
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > There seems to be a copy-pasto there - shouldn't that be: > > + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) Indeed, thanks! New patch attached. -- Thomas Munro http://www.enterprisedb.com
Attachment
On 3 February 2016 at 10:46, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> There seems to be a copy-pasto there - shouldn't that be: >> >> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) > > Indeed, thanks! New patch attached. I've given this a test drive, and it works exactly as described. But one thing which confuses me is when a standby, with causal_reads enabled, has just finished starting up. I can't connect to it because: FATAL: standby is not available for causal reads However, this is the same message when I'm successfully connected, but it's lagging, and the primary is still waiting for the standby to catch up: ERROR: standby is not available for causal reads What is the difference here? The problem being reported appears to be identical, but in the first case I can't connect, but in the second case I can (although I still can't issue queries). Thom
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Mon, Feb 22, 2016 at 2:10 AM, Thom Brown <thom@linux.com> wrote: > On 3 February 2016 at 10:46, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> There seems to be a copy-pasto there - shouldn't that be: >>> >>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) >> >> Indeed, thanks! New patch attached. > > I've given this a test drive, and it works exactly as described. Thanks for trying it out! > But one thing which confuses me is when a standby, with causal_reads > enabled, has just finished starting up. I can't connect to it > because: > > FATAL: standby is not available for causal reads > > However, this is the same message when I'm successfully connected, but > it's lagging, and the primary is still waiting for the standby to > catch up: > > ERROR: standby is not available for causal reads > > What is the difference here? The problem being reported appears to be > identical, but in the first case I can't connect, but in the second > case I can (although I still can't issue queries). Right, you get the error at login before it has managed to connect to the primary, and for a short time after while it's in 'joining' state, or potentially longer if there is a backlog of WAL to apply. The reason is that when causal_reads = on in postgresql.conf (as opposed to being set for an individual session or role), causal reads logic is used for snapshots taken during authentication (in fact the error is generated when trying to take a snapshot slightly before authentication proper begins, in InitPostgres). I think that's a desirable feature: if you have causal reads on and you create/alter a database/role (for example setting a new password) and commit, and then you immediately try to connect to that database/role on a standby where you have causal reads enabled system-wide, then you get the causal reads guarantee during authentication: you either see the effects of your earlier transaction or you see the error. As you have discovered, there is a small window after a standby comes up where it will show the error because it hasn't got a lease yet so it can't let you log in yet because it could be seeing a stale catalog (your user may not exist on the standby yet, or have been altered in some way, or your database may not exist yet, etc). Does that make sense? -- Thomas Munro http://www.enterprisedb.com
On 21 February 2016 at 23:18, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Feb 22, 2016 at 2:10 AM, Thom Brown <thom@linux.com> wrote: >> On 3 February 2016 at 10:46, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >>> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote >>> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>> There seems to be a copy-pasto there - shouldn't that be: >>>> >>>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush) >>> >>> Indeed, thanks! New patch attached. >> >> I've given this a test drive, and it works exactly as described. > > Thanks for trying it out! > >> But one thing which confuses me is when a standby, with causal_reads >> enabled, has just finished starting up. I can't connect to it >> because: >> >> FATAL: standby is not available for causal reads >> >> However, this is the same message when I'm successfully connected, but >> it's lagging, and the primary is still waiting for the standby to >> catch up: >> >> ERROR: standby is not available for causal reads >> >> What is the difference here? The problem being reported appears to be >> identical, but in the first case I can't connect, but in the second >> case I can (although I still can't issue queries). > > Right, you get the error at login before it has managed to connect to > the primary, and for a short time after while it's in 'joining' state, > or potentially longer if there is a backlog of WAL to apply. > > The reason is that when causal_reads = on in postgresql.conf (as > opposed to being set for an individual session or role), causal reads > logic is used for snapshots taken during authentication (in fact the > error is generated when trying to take a snapshot slightly before > authentication proper begins, in InitPostgres). I think that's a > desirable feature: if you have causal reads on and you create/alter a > database/role (for example setting a new password) and commit, and > then you immediately try to connect to that database/role on a standby > where you have causal reads enabled system-wide, then you get the > causal reads guarantee during authentication: you either see the > effects of your earlier transaction or you see the error. As you have > discovered, there is a small window after a standby comes up where it > will show the error because it hasn't got a lease yet so it can't let > you log in yet because it could be seeing a stale catalog (your user > may not exist on the standby yet, or have been altered in some way, or > your database may not exist yet, etc). > > Does that make sense? Ah, alles klar. Yes, that makes sense now. I've been trying to break it the past few days, and this was the only thing which I wasn't clear on. The parameters all work as described The replay_lag is particularly cool. Didn't think it was possible to glean this information on the primary, but the timings are correct in my tests. +1 for this patch. Looks like this solves the problem that semi-synchronous replication tries to solve, although arguably in a more sensible way. Thom
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Mon, Feb 22, 2016 at 9:39 AM, Thom Brown <thom@linux.com> wrote: > On 21 February 2016 at 23:18, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > The replay_lag is particularly cool. Didn't think it was possible to > glean this information on the primary, but the timings are correct in > my tests. > > +1 for this patch. Looks like this solves the problem that > semi-synchronous replication tries to solve, although arguably in a > more sensible way. Yeah, having extra logic at application layer to check if a certain LSN position has been applied or not is doable, but if we can avoid it that's a clear plus. This patch has no documentation. I will try to figure out by myself how the new parameters interact with the rest of the syncrep code while looking at it but if we want to move on to get something committable for 9.6 it would be good to get some documentation soon. -- Michael
On 27 February 2016 at 13:20, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Feb 22, 2016 at 9:39 AM, Thom Brown <thom@linux.com> wrote: >> On 21 February 2016 at 23:18, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >> The replay_lag is particularly cool. Didn't think it was possible to >> glean this information on the primary, but the timings are correct in >> my tests. >> >> +1 for this patch. Looks like this solves the problem that >> semi-synchronous replication tries to solve, although arguably in a >> more sensible way. > > Yeah, having extra logic at application layer to check if a certain > LSN position has been applied or not is doable, but if we can avoid it > that's a clear plus. > > This patch has no documentation. I will try to figure out by myself > how the new parameters interact with the rest of the syncrep code > while looking at it but if we want to move on to get something > committable for 9.6 it would be good to get some documentation soon. Could we rename "apply" to "remote_apply"? It seems more consistent with "remote_write", and matches its own enum entry too. Thom
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Sun, Feb 28, 2016 at 2:20 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Feb 22, 2016 at 9:39 AM, Thom Brown <thom@linux.com> wrote: >> On 21 February 2016 at 23:18, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >> The replay_lag is particularly cool. Didn't think it was possible to >> glean this information on the primary, but the timings are correct in >> my tests. >> >> +1 for this patch. Looks like this solves the problem that >> semi-synchronous replication tries to solve, although arguably in a >> more sensible way. > > Yeah, having extra logic at application layer to check if a certain > LSN position has been applied or not is doable, but if we can avoid it > that's a clear plus. > > This patch has no documentation. I will try to figure out by myself > how the new parameters interact with the rest of the syncrep code > while looking at it but if we want to move on to get something > committable for 9.6 it would be good to get some documentation soon. Thanks for looking at the patch! Here is a new version with the following changes: 1. Some draft user documentation has been added, as requested. 2. The new synchronous commit level (separate from the causal reads feature, but implemented for completeness) is now called "remote_apply" instead of "apply", as suggested by Thom Brown. 3. There is a draft README.causal_reads file which provides some developer notes about state transitions, leases and clock skew. 4. The 'joining' state management has been improved (it's now based on xlog positions rather than time; see the README and comments for details). 5. The ps title is now restored after it is modified during causal reads-related waiting. 6. I assigned an errcode (40P02) for causal reads failures (useful for clients/middleware/libraries that might want to handle this error automatically), as suggested by a couple of people off-list. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Amit Langote
Date:
Hi Thomas, On 2016/02/29 15:20, Thomas Munro wrote: > Thanks for looking at the patch! Here is a new version with the > following changes: > > 1. Some draft user documentation has been added, as requested. Just to clarify, in: + servers. A transaction that is run with <varname>causal_reads</> set + to <literal>on</> is guaranteed either to see the effects of all + completed transactions run on the primary with the setting on, or to + receive an error "standby is not available for causal reads". "A transaction that is run" means "A transaction that is run on a standby", right? By the way, is there some discussion in our existing documentation to refer to about causal consistency in single node case? I don't know maybe that will help ease into the new feature. Grepping the existing source tree doesn't reveal the term "causal", so maybe even a single line in the patch mentioning "single node operation trivially implies (or does it?) causal consistency" would help. Thoughts? Thanks, Amit
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > Hi Thomas, > > On 2016/02/29 15:20, Thomas Munro wrote: >> Thanks for looking at the patch! Here is a new version with the >> following changes: >> >> 1. Some draft user documentation has been added, as requested. > > Just to clarify, in: > > + servers. A transaction that is run with > <varname>causal_reads</> set > + to <literal>on</> is guaranteed either to see the effects of all > + completed transactions run on the primary with the setting on, or to > + receive an error "standby is not available for causal reads". > > "A transaction that is run" means "A transaction that is run on a > standby", right? Well, it could be any server, standby or primary. Of course standbys are the interesting case since it it was already true that if you run two sequential transactions run on the primary, the second can see the effect of the first, but I like the idea of a general rule that applies anywhere, allowing you not to care which server it is. > By the way, is there some discussion in our existing > documentation to refer to about causal consistency in single node case? I > don't know maybe that will help ease into the new feature. Grepping the > existing source tree doesn't reveal the term "causal", so maybe even a > single line in the patch mentioning "single node operation trivially > implies (or does it?) causal consistency" would help. Thoughts? Hmm. Where should such a thing go? I probably haven't introduced the term well enough. I thought for a moment about putting something here: http://www.postgresql.org/docs/devel/static/sql-commit.html "All changes made by the transaction become visible to others ..." -- which others? But I backed out, that succinct account of COMMIT is 20 years old, and in any case visibility is tied to committing, not specifically to the COMMIT command. But perhaps this patch really should include something there that refers back to the causal reads section. -- Thomas Munro http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Mon, Feb 29, 2016 at 6:05 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > "All changes made by the transaction become visible to others ..." -- > which others? But I backed out, that succinct account of COMMIT is 20 > years old, and in any case visibility is tied to committing, not > specifically to the COMMIT command. But perhaps this patch really > should include something there that refers back to the causal reads > section. Luckily enough, read uncommitted behaves like read-committed in PG, making this true :) -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Amit Langote
Date:
Hi, On 2016/02/29 18:05, Thomas Munro wrote: > On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote: >> + servers. A transaction that is run with >> <varname>causal_reads</> set >> + to <literal>on</> is guaranteed either to see the effects of all >> + completed transactions run on the primary with the setting on, or to >> + receive an error "standby is not available for causal reads". >> >> "A transaction that is run" means "A transaction that is run on a >> standby", right? > > Well, it could be any server, standby or primary. Of course standbys > are the interesting case since it it was already true that if you run > two sequential transactions run on the primary, the second can see the > effect of the first, but I like the idea of a general rule that > applies anywhere, allowing you not to care which server it is. I meant actually in context of that sentence only. >> By the way, is there some discussion in our existing >> documentation to refer to about causal consistency in single node case? I >> don't know maybe that will help ease into the new feature. Grepping the >> existing source tree doesn't reveal the term "causal", so maybe even a >> single line in the patch mentioning "single node operation trivially >> implies (or does it?) causal consistency" would help. Thoughts? > > Hmm. Where should such a thing go? I probably haven't introduced the > term well enough. I thought for a moment about putting something > here: > > http://www.postgresql.org/docs/devel/static/sql-commit.html > > "All changes made by the transaction become visible to others ..." -- > which others? But I backed out, that succinct account of COMMIT is 20 > years old, and in any case visibility is tied to committing, not > specifically to the COMMIT command. But perhaps this patch really > should include something there that refers back to the causal reads > section. I see. I agree this is not exactly material for the COMMIT page. Perhaps somewhere under "Chapter 13. Concurrency Control" with cross-reference to/from "25.5. Hot Standby". Might be interesting to hear from others as well. Thanks, Amit
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Tue, Mar 1, 2016 at 2:46 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > Hi, > > On 2016/02/29 18:05, Thomas Munro wrote: >> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote: >>> + servers. A transaction that is run with >>> <varname>causal_reads</> set >>> + to <literal>on</> is guaranteed either to see the effects of all >>> + completed transactions run on the primary with the setting on, or to >>> + receive an error "standby is not available for causal reads". >>> >>> "A transaction that is run" means "A transaction that is run on a >>> standby", right? >> >> Well, it could be any server, standby or primary. Of course standbys >> are the interesting case since it it was already true that if you run >> two sequential transactions run on the primary, the second can see the >> effect of the first, but I like the idea of a general rule that >> applies anywhere, allowing you not to care which server it is. > > I meant actually in context of that sentence only. Ok, here's a new version that includes that change, fixes a conflict with recent commit 10b48522 and removes an accidental duplicate copy of the README file. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Tue, Mar 1, 2016 at 11:53 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Mar 1, 2016 at 2:46 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> Hi, >> >> On 2016/02/29 18:05, Thomas Munro wrote: >>> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote: >>>> + servers. A transaction that is run with >>>> <varname>causal_reads</> set >>>> + to <literal>on</> is guaranteed either to see the effects of all >>>> + completed transactions run on the primary with the setting on, or to >>>> + receive an error "standby is not available for causal reads". >>>> >>>> "A transaction that is run" means "A transaction that is run on a >>>> standby", right? >>> >>> Well, it could be any server, standby or primary. Of course standbys >>> are the interesting case since it it was already true that if you run >>> two sequential transactions run on the primary, the second can see the >>> effect of the first, but I like the idea of a general rule that >>> applies anywhere, allowing you not to care which server it is. >> >> I meant actually in context of that sentence only. > > Ok, here's a new version that includes that change, fixes a conflict > with recent commit 10b48522 and removes an accidental duplicate copy > of the README file. Finally I got my eyes on this patch. To put it short, this patch introduces two different concepts: - addition of a new value, remote_apply, for synchronous_commit, which is actually where things overlap a bit with the N-sync patch, because by combining the addition of remote_apply with the N-sync patch, it is possible to ensure that an LSN is applied to multiple targets instead of one now. In this case, as the master will wait for the LSN to be applied on the sync standby, there is no need for the application to have error handling in case a read transaction is happening on the standby as the change is already visible on the standby when committing it on master. However the cost here is that if the standby node lags behind, it puts some extra wait as well on the master side. - casual reads, which makes the master able to drop standbys that are lagging too much behind and let callers of standbys know that it is lagging to much behind and cannot satisfy causal reads. In this case error handling is needed by the application in case a standby is lagging to much behind. That's one of my concerns about this patch now: it is trying to do too much. I think that there is definitely a need for both things: applications may be fine to pay the lagging price when remote_apply is used by not having an extra error handling layer, or they cannot accept a perhaps large of lag and are ready to have an extra error handling layer to manage read failures on standbys. So I would recommend a split to begin with: 1) Addition of remote_apply in synchronous_commit, which would be quite a simple patch, and I am convinced that there are benefits in having that. Compared to the previous patch sent, a standby message is sent back to the master once COMMIT has been applied, accelerating things a bit. 2) Patch for causal reads, with all its extra parametrization logic and stuff to select standby candidates. Here is as well a more detailed review... Regression tests are failing, rules.sql is generating diffs because pg_stat_replication is changed. CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE, COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also call XLogRequestWalReceiverReply() when needed. The new parameters are missing from postgresql.conf.sample. +static bool +SyncRepCheckEarlyExit(void) +{ Doing this refactoring would actually make sense as a separate patch I think, and that would simplify the core patch for causal reads. +For this reason we say that the causal reads guarantee only holds as +long as the absolute difference between the system clocks of the +machines is no more than max_clock_skew. The theory is that NTP makes +it possible to reason about the maximum possible clock difference +between machines and choose a value that allows for a much larger +difference. However, we do make a best effort attempt to detect +misconfigured systems as described above, to catch the case of servers +not running ntp a correctly configured ntp daemon, or with a clock so +far out of whack that ntp refuses to fix it. Just wondering how this reacts when standby and master are on different timezones. I know of two ways to measure WAL lag: one is what you are doing, by using a timestamp and rely on the two servers to be in sync at clock level. The second is in bytes with a WAL quantity, though it is less user-friendly to set up, say max_wal_lag or similar, symbolized by the number of WAL segments the standby is lagging behind, the concerns regarding clock sync across nodes go away. To put it short, I find the timestamp approach easy to set up and understand for the user, but not really solid as it depends much on the state dependency between different hosts, while a difference between flush and apply LSN positions is a quite independent concept. So basically what would be used as a base comparison is not the timestamp of the transaction commit but the flush LSN at the moment commit has been replayed. + /* + * If a causal_reads_timeout is configured, it is used instead of + * wal_sender_timeout. Ideally we'd use causal_reads_timeout / 2 + + * allowance for network latency, but since walreceiver can become quite + * bogged down fsyncing WAL we allow more tolerance. (This could be + * tightened up once standbys hand writing off to the WAL writer). + */ + if (causal_reads_timeout != 0) + allowed_time = causal_reads_timeout; + else + allowed_time = wal_sender_timeout; I find that surprising, for two reasons: 1) it seems to me that causal_read_timeout has in concept no relation with WAL sender process control. 2) A standby should still be able to receive WAL even if it cannot satisfy causal reads to give it a chance to catch up faster the amount it is late. - SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */ + SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */ + SYNCHRONOUS_COMMIT_REMOTE_APPLY, /* wait for local flush and remote + * apply */ + SYNCHRONOUS_COMMIT_CONSISTENT_APPLY /* wait for local flusha and remote + apply with causal consistency */ SYNCHRONOUS_COMMIT_CONSISTENT_APPLY is used nowhere, and there is a typo s/flusha/flush a/. I am still digging into the patch, the available/joining/unavailable logic being quite interesting. -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Thu, Mar 3, 2016 at 3:34 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > That's one of my concerns about this patch now: it is trying to do too > much. I think that there is definitely a need for both things: > applications may be fine to pay the lagging price when remote_apply is > used by not having an extra error handling layer, or they cannot > accept a perhaps large of lag and are ready to have an extra error > handling layer to manage read failures on standbys. So I would > recommend a split to begin with: > 1) Addition of remote_apply in synchronous_commit, which would be > quite a simple patch, and I am convinced that there are benefits in > having that. Compared to the previous patch sent, a standby message is > sent back to the master once COMMIT has been applied, accelerating > things a bit. Hm. Looking now at http://www.postgresql.org/message-id/CANP8+j+jCpNoOjc-KQLtt4PDyOX2Sq6wYWqCSy6aaHWkvNa0hw@mail.gmail.com it would be nice to get a clear solution for it first, though the use of signals to wake up the WAL receiver and enforce it to send a new LSN apply position back to the master to unlock it asap does not look very appealing. Seeing that no patch has been sent for 9.6 regarding that, it would be better to simply drop this code from the causal-read patch perhaps... -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Thu, Mar 3, 2016 at 7:34 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Mar 1, 2016 at 11:53 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Tue, Mar 1, 2016 at 2:46 PM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> >>> Hi, >>> >>> On 2016/02/29 18:05, Thomas Munro wrote: >>>> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote: >>>>> + servers. A transaction that is run with >>>>> <varname>causal_reads</> set >>>>> + to <literal>on</> is guaranteed either to see the effects of all >>>>> + completed transactions run on the primary with the setting on, or to >>>>> + receive an error "standby is not available for causal reads". >>>>> >>>>> "A transaction that is run" means "A transaction that is run on a >>>>> standby", right? >>>> >>>> Well, it could be any server, standby or primary. Of course standbys >>>> are the interesting case since it it was already true that if you run >>>> two sequential transactions run on the primary, the second can see the >>>> effect of the first, but I like the idea of a general rule that >>>> applies anywhere, allowing you not to care which server it is. >>> >>> I meant actually in context of that sentence only. >> >> Ok, here's a new version that includes that change, fixes a conflict >> with recent commit 10b48522 and removes an accidental duplicate copy >> of the README file. > > Finally I got my eyes on this patch. > > To put it short, this patch introduces two different concepts: > - addition of a new value, remote_apply, for synchronous_commit, which > is actually where things overlap a bit with the N-sync patch, because > by combining the addition of remote_apply with the N-sync patch, it is > possible to ensure that an LSN is applied to multiple targets instead > of one now. In this case, as the master will wait for the LSN to be > applied on the sync standby, there is no need for the application to > have error handling in case a read transaction is happening on the > standby as the change is already visible on the standby when > committing it on master. However the cost here is that if the standby > node lags behind, it puts some extra wait as well on the master side. > - casual reads, which makes the master able to drop standbys that are > lagging too much behind and let callers of standbys know that it is > lagging to much behind and cannot satisfy causal reads. In this case > error handling is needed by the application in case a standby is > lagging to much behind. > > That's one of my concerns about this patch now: it is trying to do too > much. I think that there is definitely a need for both things: > applications may be fine to pay the lagging price when remote_apply is > used by not having an extra error handling layer, or they cannot > accept a perhaps large of lag and are ready to have an extra error > handling layer to manage read failures on standbys. So I would > recommend a split to begin with: > 1) Addition of remote_apply in synchronous_commit, which would be > quite a simple patch, and I am convinced that there are benefits in > having that. Compared to the previous patch sent, a standby message is > sent back to the master once COMMIT has been applied, accelerating > things a bit. > 2) Patch for causal reads, with all its extra parametrization logic > and stuff to select standby candidates. Thanks. Yes, this makes a lot of sense. I have done some work on splitting this up and will post the result soon, along with my responses to your other feedback. > Here is as well a more detailed review... > > Regression tests are failing, rules.sql is generating diffs because > pg_stat_replication is changed. > > CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE, > COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also > call XLogRequestWalReceiverReply() when needed. > > The new parameters are missing from postgresql.conf.sample. > > +static bool > +SyncRepCheckEarlyExit(void) > +{ > Doing this refactoring would actually make sense as a separate patch I > think, and that would simplify the core patch for causal reads. > > +For this reason we say that the causal reads guarantee only holds as > +long as the absolute difference between the system clocks of the > +machines is no more than max_clock_skew. The theory is that NTP makes > +it possible to reason about the maximum possible clock difference > +between machines and choose a value that allows for a much larger > +difference. However, we do make a best effort attempt to detect > +misconfigured systems as described above, to catch the case of servers > +not running ntp a correctly configured ntp daemon, or with a clock so > +far out of whack that ntp refuses to fix it. > Just wondering how this reacts when standby and master are on > different timezones. I know of two ways to measure WAL lag: one is > what you are doing, by using a timestamp and rely on the two servers > to be in sync at clock level. The second is in bytes with a WAL > quantity, though it is less user-friendly to set up, say max_wal_lag > or similar, symbolized by the number of WAL segments the standby is > lagging behind, the concerns regarding clock sync across nodes go > away. To put it short, I find the timestamp approach easy to set up > and understand for the user, but not really solid as it depends much > on the state dependency between different hosts, while a difference > between flush and apply LSN positions is a quite independent concept. > So basically what would be used as a base comparison is not the > timestamp of the transaction commit but the flush LSN at the moment > commit has been replayed. > > + /* > + * If a causal_reads_timeout is configured, it is used instead of > + * wal_sender_timeout. Ideally we'd use causal_reads_timeout / 2 + > + * allowance for network latency, but since walreceiver can become quite > + * bogged down fsyncing WAL we allow more tolerance. (This could be > + * tightened up once standbys hand writing off to the WAL writer). > + */ > + if (causal_reads_timeout != 0) > + allowed_time = causal_reads_timeout; > + else > + allowed_time = wal_sender_timeout; > I find that surprising, for two reasons: > 1) it seems to me that causal_read_timeout has in concept no relation > with WAL sender process control. > 2) A standby should still be able to receive WAL even if it cannot > satisfy causal reads to give it a chance to catch up faster the amount > it is late. > > - SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */ > + SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */ > + SYNCHRONOUS_COMMIT_REMOTE_APPLY, /* wait for local flush and remote > + * apply */ > + SYNCHRONOUS_COMMIT_CONSISTENT_APPLY /* wait for local flusha and remote > + apply with causal consistency */ > SYNCHRONOUS_COMMIT_CONSISTENT_APPLY is used nowhere, and there is a > typo s/flusha/flush a/. > > I am still digging into the patch, the available/joining/unavailable > logic being quite interesting. > -- > Michael -- Thomas Munro http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Thu, Mar 3, 2016 at 7:34 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Finally I got my eyes on this patch. Thanks, I really appreciate your time and interest. Your feedback gave me plenty to chew on. See below for replies to both your recent emails. > To put it short, this patch introduces two different concepts: > - addition of a new value, remote_apply, for synchronous_commit, which > is actually where things overlap a bit with the N-sync patch, because > by combining the addition of remote_apply with the N-sync patch, it is > possible to ensure that an LSN is applied to multiple targets instead > of one now. In this case, as the master will wait for the LSN to be > applied on the sync standby, there is no need for the application to > have error handling in case a read transaction is happening on the > standby as the change is already visible on the standby when > committing it on master. However the cost here is that if the standby > node lags behind, it puts some extra wait as well on the master side. > - casual reads, which makes the master able to drop standbys that are > lagging too much behind and let callers of standbys know that it is > lagging to much behind and cannot satisfy causal reads. In this case > error handling is needed by the application in case a standby is > lagging to much behind. > > That's one of my concerns about this patch now: it is trying to do too > much. I think that there is definitely a need for both things: > applications may be fine to pay the lagging price when remote_apply is > used by not having an extra error handling layer, or they cannot > accept a perhaps large of lag and are ready to have an extra error > handling layer to manage read failures on standbys. So I would > recommend a split to begin with: > 1) Addition of remote_apply in synchronous_commit, which would be > quite a simple patch, and I am convinced that there are benefits in > having that. Compared to the previous patch sent, a standby message is > sent back to the master once COMMIT has been applied, accelerating > things a bit. > 2) Patch for causal reads, with all its extra parametrization logic > and stuff to select standby candidates. Agreed. I have split this work up into four patches which apply on top of each other, and provide something (hopefully) useful at each stage. Namely: 0001-remote-apply.patch: This adds synchronous_commit = remote_apply. It works by setting a new bit in commit records to say that the primary requests feedback when the record is applied. When the recovery process sees this bit, it wakes the walreceiver process and asks it to send a reply to the primary ASAP. 0002-replay-lag.patch: This adds a time-based estimate of the current apply lag on each standby. The value is visible in pg_stat_replication. It uses a combination of commit timestamps and periodic LSN->time samples from keepalive messages, so that replay lag is computed even when there are no commits, for example during bulk data loads. This builds on the previous patch, because it makes use of the mechanism whereby the recovery process can ask the walreceiver to send replies whenever it applies WAL records for which it has timestamps. 0003-refactor-syncrep-exit.patch: This moves the code to handle syncrep wait cancellation into a function, as you suggested. Conceptually independent of the previous two patches, but presented as a patch on top of the previous patches and separately from causal-reads.patch for ease of review. 0004-causal-reads.patch: This adds the causal reads feature, building on top of the other three patches. From the remote-apply patch it uses the mechanism for asking for apply feedback. From the replay-lag patch it uses the mechanism for tracking apply lag. It also uses syncrep wait cancellation code. > Here is as well a more detailed review... > > Regression tests are failing, rules.sql is generating diffs because > pg_stat_replication is changed. Fixed, thanks. > CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE, > COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also > call XLogRequestWalReceiverReply() when needed. Ah yes, agreed for COMMIT PREPARED. Fixed. But why for PREPARE TRANSACTION or ROLLBACK PREPARED? Those don't generate visible effects that others might expect to see on a standby, so I don't think that is necessary. > The new parameters are missing from postgresql.conf.sample. Fixed, thanks. > +static bool > +SyncRepCheckEarlyExit(void) > +{ > Doing this refactoring would actually make sense as a separate patch I > think, and that would simplify the core patch for causal reads. Agreed -- see 0003-refactor-syncrep-exit.patch. > +For this reason we say that the causal reads guarantee only holds as > +long as the absolute difference between the system clocks of the > +machines is no more than max_clock_skew. The theory is that NTP makes > +it possible to reason about the maximum possible clock difference > +between machines and choose a value that allows for a much larger > +difference. However, we do make a best effort attempt to detect > +misconfigured systems as described above, to catch the case of servers > +not running ntp a correctly configured ntp daemon, or with a clock so > +far out of whack that ntp refuses to fix it. > > Just wondering how this reacts when standby and master are on > different timezones. I know of two ways to measure WAL lag: one is > what you are doing, by using a timestamp and rely on the two servers > to be in sync at clock level. The second is in bytes with a WAL > quantity, though it is less user-friendly to set up, say max_wal_lag > or similar, symbolized by the number of WAL segments the standby is > lagging behind, the concerns regarding clock sync across nodes go > away. To put it short, I find the timestamp approach easy to set up > and understand for the user, but not really solid as it depends much > on the state dependency between different hosts, while a difference > between flush and apply LSN positions is a quite independent concept. > So basically what would be used as a base comparison is not the > timestamp of the transaction commit but the flush LSN at the moment > commit has been replayed. I'm not actually using either of those approaches. Postgres already periodically captures (time, WAL end) pairs on the primary, and feeds them to the standby. With this patch set, the standby records these samples in a circular buffer, and when the recovery process eventually applies sampled WAL positions, it feeds the associated timestamps back to the primary in unsolicited keepalive reply messages. Then the primary can judge how long it took the standby to apply the sampled WAL position. It doesn't suffer from clock skew problems because it's comparing two timestamps obtained from the system clock on the same box. On the other hand, the time reported includes an extra network latency term as the reply message has to reach the primary. That might be a good thing, depending on your purpose, as it reflects the time you'd be waiting for syncrep/causal reads on that standby, which includes such latency. I had to implement this because, in an early prototype of the causal reads feature, the system would choke on large data load transactions even though the standbys were able to keep up. Lag could only be measured when commit records came along. That meant that a very big data load was indistinguishable from a system not applying fast enough, and the standby would drop to causal reads unavailable state spuriously. I considered interpolating WAL positions in between commit records, but couldn't see how to make it work. I considered injecting extra timestamps into the WAL periodically, but I didn't need anything persistent. I just needed a way to collect occasional (time, LSN) samples in memory and feed the times back at the appropriate moments as the WAL is replayed. > + /* > + * If a causal_reads_timeout is configured, it is used instead of > + * wal_sender_timeout. Ideally we'd use causal_reads_timeout / 2 + > + * allowance for network latency, but since walreceiver can become quite > + * bogged down fsyncing WAL we allow more tolerance. (This could be > + * tightened up once standbys hand writing off to the WAL writer). > + */ > + if (causal_reads_timeout != 0) > + allowed_time = causal_reads_timeout; > + else > + allowed_time = wal_sender_timeout; > I find that surprising, for two reasons: > 1) it seems to me that causal_read_timeout has in concept no relation > with WAL sender process control. > 2) A standby should still be able to receive WAL even if it cannot > satisfy causal reads to give it a chance to catch up faster the amount > it is late. This is only used to detect dead/disconnected/unreachable standbys, not to handle standbys that are merely not keeping up with replication traffic. To get disconnected this way, a standby has to be not responding to keepalive pings. When dealing with a causal reads standby, we want to cap the time we're prepared to hang around waiting for a zombie standby before nuking it. If we didn't have that change, then we would stop waiting for slow standbys (not applying fast enough) in a timely fashion, but crashed/disconnected standbys would make our commits hang for the potentially much longer wal_sender_timeout, which seems silly. Zombie servers shouldn't be able to interfere with your commits for longer periods of time than lagging servers. > - SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */ > + SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */ > + SYNCHRONOUS_COMMIT_REMOTE_APPLY, /* wait for local flush and remote > + * apply */ > + SYNCHRONOUS_COMMIT_CONSISTENT_APPLY /* wait for local flusha and remote > + apply with causal consistency */ > SYNCHRONOUS_COMMIT_CONSISTENT_APPLY is used nowhere, and there is a > typo s/flusha/flush a/. > > I am still digging into the patch, the available/joining/unavailable > logic being quite interesting. I hope the new patch set makes that job easier... On Thu, Mar 3, 2016 at 8:00 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Hm. Looking now at > http://www.postgresql.org/message-id/CANP8+j+jCpNoOjc-KQLtt4PDyOX2Sq6wYWqCSy6aaHWkvNa0hw@mail.gmail.com > it would be nice to get a clear solution for it first, though the use > of signals to wake up the WAL receiver and enforce it to send a new > LSN apply position back to the master to unlock it asap does not look > very appealing. Seeing that no patch has been sent for 9.6 regarding > that, it would be better to simply drop this code from the causal-read > patch perhaps... The signalling mechanism is still present. Since that discussion I had the idea of using a different (non-overloaded) signal and unblocking it only while actually waiting for the network, so there should be no signal delivery interrupting blocking disk IO operations. But I certainly agree that the disk IO should be moved out of the walreceiver process and in to the WAL writer process, for several reasons. This patch series doesn't need that to work though. Here's a recap of how the signal is used, in the context of the communication introduced by each patch: 0001-remote-apply.patch * the primary sets a new bit XACT_COMPLETION_SYNC_APPLY_FEEDBACK in commit records' xinfo when synchronous_commit = remote_apply * the WAL record is transported walsender->walreceiver as usual * the recovery process eventually applies the commit record, and, when it sees that bit, it wakes the walreceiver with SIGUSR2 * when walreceiver receives the signal it generates a keepalive reply to report the latest apply position * the walsender uses that information to release syncrep waiters (just as it always did) 0002-replay-lag.patch * the walreceiver records (time, lsn) pairs into a circular buffer as they arrive * the recovery process, when it discovers that it has replayed an LSN associated with a time in the circular buffer, uses the same signal as used in 0001-remote-apply.patch to ask walreceiver to report this progress * the walreceiver includes the last replay timestamp in a new field added to the reply message 0004-causal-reads.patch * reuses the above to implement causal_reads mode: XACT_COMPLETION_SYNC_APPLY_FEEDBACK is now also set if causal_reads is on, because it too needs to know when commit records are applied * the rest of the patch implements the new state machine, lease control and waiting machinery Some alternatives to that signal could include: a pipe between recovery and walreceiver that could be multiplexed with network IO (which may be tricky to implement cross-platform, I don't know, but I don't see anything similar in Postgres), or a Postgres latch that could be multiplexed with network IO using some future primitive as suggested elsewhere (but would ultimately still involve a signal). There would be plenty of scope for evolution, for example when integrating with the standby WAL writer work, but the signal approach seemed simple and effective for now and certainly isn't the first case of backends signalling each other. Thanks, -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Thu, Mar 3, 2016 at 7:34 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > WAL replay for 2PC should also > call XLogRequestWalReceiverReply() when needed. Ah yes, I missed this important sentence. I will address that in the next version after some testing. -- Thomas Munro http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Wed, Mar 9, 2016 at 6:07 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Agreed. I have split this work up into four patches which apply on > top of each other, and provide something (hopefully) useful at each > stage. Yesterday's patch set doesn't apply after commit b6fb6471f6afaf649e52f38269fd8c5c60647669 which added a neighbouring line in pg_proc.h, so here's a new set that does. I looked into COMMIT PREPARED replay feedback and realised that it doesn't need any special handling beyond what is already in xact_redo_commit. However, I see now that I *do* need to do something when replaying PREPARE TRANSACTION, as you said. Not for causal reads though -- it doesn't care about an operation with no visible effect -- but for synchronous_commit = remote_apply. I am thinking about how to fix that. (Have PREPARE TRANSACTION wait only for flush even though you asked for remote_apply? Add a 'feedback please' bit to XLOG_XACT_PREPARE records? Always send feedback when replaying XLOG_XACT_PREPARE records?) The following rough ballpark numbers (generated with the attached test client) aren't very scientific or in any way indicative of real conditions (it's a bunch of clusters running on my laptop), but they demonstrate that two-phase commit apply feedback is being reported to the primary straight away in causal reads mode (otherwise the 2PC causal reads number wouldn't be so high). Sequential UPDATE in simple transaction: single node: ~2700 TPS sync rep remote flush: ~2500 TPS sync rep remote apply: ~2000 TPS causal reads (4 standbys): ~1600 TPS Sequential UPDATE in two phase commit transaction: single node: ~900 TPS sync rep remote flush: ~900 TPS sync rep remote apply: (hangs) causal reads (4 standbys): ~900 TPS (The actual numbers are pretty noisy. I've taken medians of 3 and rounded to the nearest 100, and I guess the replication overheads are not magnified as much in the case of the slower 2PC workload and then get lost in the noise. With --check you can verify that the 2PC transaction is not always visible on the standby it connects to until you enable --causal-reads, so I don't think it's just broken!) -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Thu, Mar 10, 2016 at 12:35 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I looked into COMMIT PREPARED replay feedback and realised that it > doesn't need any special handling beyond what is already in > xact_redo_commit. However, I see now that I *do* need to do something > when replaying PREPARE TRANSACTION, as you said. Not for causal reads > though -- it doesn't care about an operation with no visible effect -- > but for synchronous_commit = remote_apply. I am thinking about how to > fix that. Ok, here is a new version with the following changes: 1. If you PREPARE TRANSACTION with synchronous_commit = remote_apply, it just waits for remote WAL flush. When you eventually COMMIT/ROLLBACK PREPARED, it waits for remote apply. Also, the XACT_COMPLETION_SYNC_APPLY_FEEDBACK bit is now set in appropriate abort records, and then handled in recovery, just as for commits, so that ROLLBACK and ROLLBACK PREPARED return at the right time in remote_apply. 2. I fixed a recently introduced stupid bug that caused causal reads to be broken when sync rep wasn't also configured (I had lost a change to stop SyncRepReleaseWaiters from leaving early, when I split up the patch... oops). 3. I switched the pg_stat_replication.replay_lag accounting from milliseconds to microseconds. The measured lag with fast machines on local networks can sometimes be sub-millisecond or very low numbers of milliseconds, so it's interesting to see more detail (depending on the primary's clock resolution). -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Thu, Mar 10, 2016 at 8:35 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Mar 10, 2016 at 12:35 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> I looked into COMMIT PREPARED replay feedback and realised that it >> doesn't need any special handling beyond what is already in >> xact_redo_commit. However, I see now that I *do* need to do something >> when replaying PREPARE TRANSACTION, as you said. Not for causal reads >> though -- it doesn't care about an operation with no visible effect -- >> but for synchronous_commit = remote_apply. I am thinking about how to >> fix that. > > Ok, here is a new version with the following changes: > > 1. If you PREPARE TRANSACTION with synchronous_commit = remote_apply, > it just waits for remote WAL flush. When you eventually > COMMIT/ROLLBACK PREPARED, it waits for remote apply. Also, the > XACT_COMPLETION_SYNC_APPLY_FEEDBACK bit is now set in appropriate > abort records, and then handled in recovery, just as for commits, so > that ROLLBACK and ROLLBACK PREPARED return at the right time in > remote_apply. > > 2. I fixed a recently introduced stupid bug that caused causal reads > to be broken when sync rep wasn't also configured (I had lost a change > to stop SyncRepReleaseWaiters from leaving early, when I split up the > patch... oops). > > 3. I switched the pg_stat_replication.replay_lag accounting from > milliseconds to microseconds. The measured lag with fast machines on > local networks can sometimes be sub-millisecond or very low numbers of > milliseconds, so it's interesting to see more detail (depending on the > primary's clock resolution). The last patches I posted don't apply today due to changes in master, so here's a freshly merged patch series. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Sun, Mar 13, 2016 at 11:50 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > The last patches I posted don't apply today due to changes in master, > so here's a freshly merged patch series. + from the current synchronous stanbyindicates it has received the Uh, no. - SyncRepWaitForLSN(gxact->prepare_end_lsn); + { + /* + * Don't wait for the prepare to be applied remotely in remote_apply + * mode, just wait for it to be flushed to the WAL. We will wait for + * apply when the transaction eventuallly commits or aborts. + */ + if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY) + assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_FLUSH, NULL); + + SyncRepWaitForLSN(gxact->prepare_end_lsn); + + if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY) + assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_APPLY, NULL); + } What's with the extra block? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Tue, Mar 15, 2016 at 6:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Mar 13, 2016 at 11:50 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> The last patches I posted don't apply today due to changes in master, >> so here's a freshly merged patch series. > > + from the current synchronous stanbyindicates it has received the > > Uh, no. Oops, thanks, fixed. I'll wait for some more feedback or a conflict with master before sending a new version. > - SyncRepWaitForLSN(gxact->prepare_end_lsn); > + { > + /* > + * Don't wait for the prepare to be applied remotely in remote_apply > + * mode, just wait for it to be flushed to the WAL. We will wait for > + * apply when the transaction eventuallly commits or aborts. > + */ > + if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY) > + assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_FLUSH, NULL); > + > + SyncRepWaitForLSN(gxact->prepare_end_lsn); > + > + if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY) > + assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_APPLY, NULL); > + } > > What's with the extra block? Yeah, that's silly, thanks. Tidied up for the next version. -- Thomas Munro http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Mon, Mar 14, 2016 at 2:38 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> What's with the extra block? > > Yeah, that's silly, thanks. Tidied up for the next version. Some more comments on 0001: + <literal>remote_write</>, <literal> remote_apply</>, <literal>local</>, and <literal>off</>. Extra space. + * apply when the transaction eventuallly commits or aborts. Spelling. + if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY) + assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_FLUSH, NULL); + + SyncRepWaitForLSN(gxact->prepare_end_lsn); + + if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY) + assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_APPLY, NULL); You can't do this. Directly changing the value of a variable that is backing a GUC is verboten, and doing it through the thin veneer of calling the assign-hook will not avoid the terrible wrath of the powers that dwell in the outer dark, and/or Pittsburgh. You probably need a dance here similar to the way forcePageWrites/fullPageWrites work. /* + * Check if the caller would like to ask standbys for immediate feedback + * once this commit is applied. + */ Whitespace. + /* + * Check if the caller would like to ask standbys for immediate feedback + * once this abort is applied. + */ Whitespace again. /* + * doRequestWalReceiverReply is used by recovery code to ask the main recovery + * loop to trigger a walreceiver reply. + */ +static bool doRequestWalReceiverReply; This is the sort of comment that leads me to ask "why even bother writing a comment?". Try to say something that's not obvious from the variable name. The comment for XLogRequestWalReceiverReply has a similar issue. +static void WalRcvBlockSigUsr2(void) Style - newline after void. +static void WalRcvUnblockSigUsr2(void) And again here. + WalRcvUnblockSigUsr2(); len = walrcv_receive(NAPTIME_PER_CYCLE, &buf); + WalRcvBlockSigUsr2(); This does not seem like it will be cheap on all operating systems. I think you should try to rejigger this somehow so that it can just set the process latch and the wal receiver figures it out from looking at shared memory. Like maybe a flag in WalRcvData? An advantage of this is that it should cut down on the number of signals significantly, because it won't need to send SIGUSR1 when the latch is already set. + * Although only "on", "off", "remote_apply", "remote_write", and "local" are + * documented, we accept all the likely variants of "on" and "off". Maybe switch to listing the undocumented values. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Wed, Mar 23, 2016 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 14, 2016 at 2:38 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >>> What's with the extra block? >> >> Yeah, that's silly, thanks. Tidied up for the next version. > > Some more comments on 0001: > > + <literal>remote_write</>, <literal> remote_apply</>, > <literal>local</>, and <literal>off</>. > > Extra space. Fixed. > + * apply when the transaction eventuallly commits or aborts. > > Spelling. Fixed. > + if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY) > + assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_FLUSH, NULL); > + > + SyncRepWaitForLSN(gxact->prepare_end_lsn); > + > + if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY) > + assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_APPLY, NULL); > > You can't do this. Directly changing the value of a variable that is > backing a GUC is verboten, and doing it through the thin veneer of > calling the assign-hook will not avoid the terrible wrath of the > powers that dwell in the outer dark, and/or Pittsburgh. You probably > need a dance here similar to the way forcePageWrites/fullPageWrites > work. Yeah, that was terrible. Instead of that I have now made this interface change: -SyncRepWaitForLSN(XLogRecPtr XactCommitLSN) +SyncRepWaitForLSN(XLogRecPtr lsn, bool commit) If you want to wait for non-commit records (prepare being the only case of that), you pass false in the second argument, and then the wait level is capped at remote flush inside that function. There is no way to wait for non-commit records to be applied and there is no point in making it so that you can (they don't have any user-visible effect). > /* > + * Check if the caller would like to ask standbys for immediate feedback > + * once this commit is applied. > + */ > > Whitespace. Fixed. > + /* > + * Check if the caller would like to ask standbys for immediate feedback > + * once this abort is applied. > + */ > > Whitespace again. Fixed. > /* > + * doRequestWalReceiverReply is used by recovery code to ask the main recovery > + * loop to trigger a walreceiver reply. > + */ > +static bool doRequestWalReceiverReply; > > This is the sort of comment that leads me to ask "why even bother > writing a comment?". Try to say something that's not obvious from the > variable name. The comment for XLogRequestWalReceiverReply has a > similar issue. Changed. > +static void WalRcvBlockSigUsr2(void) > > Style - newline after void. Fixed. > +static void WalRcvUnblockSigUsr2(void) > > And again here. Fixed. > + WalRcvUnblockSigUsr2(); > len = walrcv_receive(NAPTIME_PER_CYCLE, &buf); > + WalRcvBlockSigUsr2(); > > This does not seem like it will be cheap on all operating systems. I > think you should try to rejigger this somehow so that it can just set > the process latch and the wal receiver figures it out from looking at > shared memory. Like maybe a flag in WalRcvData? An advantage of this > is that it should cut down on the number of signals significantly, > because it won't need to send SIGUSR1 when the latch is already set. Still experimenting with a latch here. I will come back on this point soon. > + * Although only "on", "off", "remote_apply", "remote_write", and "local" are > + * documented, we accept all the likely variants of "on" and "off". > > Maybe switch to listing the undocumented values. It follows a pattern used by several nearby bits of code, so it doesn't look like it should be different, and besides you can see what the undocumented values are, they're right below. Here are some test results run on a bunch of Amazon EC2 "m3.large" under Ubuntu Trusty in the Oregon zone, all in the same subnet. Defaults except 1GB shared_buffers. 1. Simple sequential updates (using 'test-causal-reads.c', already posted up-thread): synchronous_commit TPS ==================== ==== off 9234 local 1223 remote_write 907 on 587 remote_apply 555 causal_reads TPS ==================== ==== 0 cr standbys 1112 1 cr standbys 541 2 cr standbys 487 3 cr standbys 467 2. Some pgbench -c4 -j2 -N bench2 runs: synchronous_commit TPS ==================== ==== off 3937 local 1984 remote_write 1701 on 1373 remote_apply 1349 causal_reads TPS ==================== ==== 0 cr standbys 1973 1 cr standbys 1413 2 cr standbys 1282 3 cr standbys 1163 -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Wed, Mar 23, 2016 at 7:34 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > synchronous_commit TPS > ==================== ==== > off 9234 > local 1223 > remote_write 907 > on 587 > remote_apply 555 > > synchronous_commit TPS > ==================== ==== > off 3937 > local 1984 > remote_write 1701 > on 1373 > remote_apply 1349 Hmm, so "remote_apply" is barely more expensive than "on". That's encouraging. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Wed, Mar 23, 2016 at 8:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 23, 2016 at 7:34 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> synchronous_commit TPS >> ==================== ==== >> off 9234 >> local 1223 >> remote_write 907 >> on 587 >> remote_apply 555 >> >> synchronous_commit TPS >> ==================== ==== >> off 3937 >> local 1984 >> remote_write 1701 >> on 1373 >> remote_apply 1349 > > Hmm, so "remote_apply" is barely more expensive than "on". That's encouraging. Indeed, interesting. This is perhaps proving that just having the possibility to have remote_apply (with feedback messages managed by signals, which is the best thing proposed for this release) is what we need to ensure the consistency of reads across nodes, and what would satisfy most of the user's requirements. Getting a slightly lower TPS may be worth the cost for some users if they can ensure that reads across nodes are accessible after a local commit, and there is no need to have an error management layer at application level to take care of errors caused by causal read timeouts. -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Wed, Mar 23, 2016 at 8:43 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Mar 23, 2016 at 8:39 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Mar 23, 2016 at 7:34 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> synchronous_commit TPS >>> ==================== ==== >>> off 9234 >>> local 1223 >>> remote_write 907 >>> on 587 >>> remote_apply 555 >>> >>> synchronous_commit TPS >>> ==================== ==== >>> off 3937 >>> local 1984 >>> remote_write 1701 >>> on 1373 >>> remote_apply 1349 >> >> Hmm, so "remote_apply" is barely more expensive than "on". That's encouraging. > > Indeed, interesting. This is perhaps proving that just having the > possibility to have remote_apply (with feedback messages managed by > signals, which is the best thing proposed for this release) is what we > need to ensure the consistency of reads across nodes, and what would > satisfy most of the user's requirements. Getting a slightly lower TPS > may be worth the cost for some users if they can ensure that reads > across nodes are accessible after a local commit, and there is no need > to have an error management layer at application level to take care of > errors caused by causal read timeouts. Well, I wouldn't go that far. It seems pretty clear that remote_apply by itself is useful - I can't imagine anybody seriously arguing the contrary, whatever they think of this implementation. My view, though, is that by itself that's pretty limiting: you can only have one standby, and if that standby falls over then you lose availability. Causal reads fixes both of those problems - admittedly that requires some knowledge in the application or the pooler, but it's no worse than SSI in that regard. Still, half a loaf is better than none, and I imagine even just getting remote_apply would make a few people quite happy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Wed, Mar 23, 2016 at 11:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Well, I wouldn't go that far. It seems pretty clear that remote_apply > by itself is useful - I can't imagine anybody seriously arguing the > contrary, whatever they think of this implementation. My view, > though, is that by itself that's pretty limiting: you can only have > one standby, and if that standby falls over then you lose > availability. Causal reads fixes both of those problems - admittedly > that requires some knowledge in the application or the pooler, but > it's no worse than SSI in that regard. Still, half a loaf is better > than none, and I imagine even just getting remote_apply would make a > few people quite happy. OK, let's do so then, even if causal reads don't get into 9.6 users could get advantage of remote_apply on multiple nodes if the N-sync patch gets in. Just looking at 0001. - <literal>remote_write</>, <literal>local</>, and <literal>off</>. + <literal>remote_write</>, <literal>remote_apply</>, <literal>local</>, and <literal>off</>. The default, and safe, setting I imagine that a run of pgindent would be welcome for such large lines. +#define XactCompletionSyncApplyFeedback(xinfo) \ + (!!(xinfo & XACT_COMPLETION_SYNC_APPLY_FEEDBACK)) That's not directly something this patch should take care of, but the notation "!!" has better be avoided (see stdbool thread with VS2015). - SyncRepWaitForLSN(gxact->prepare_end_lsn); + SyncRepWaitForLSN(gxact->prepare_end_lsn, false); Isn't it important to ensure that a PREPARE LSN is applied as well on the standby with remote_apply? Say if an application prepares a transaction, it would commit locally but its LSN may not be applied on the standby with this patch. That would be a surprising behavior for the user. (not commenting on the latch and SIGUSR2 handling, you are still working on it per your last update). -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Thu, Mar 24, 2016 at 2:11 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Mar 23, 2016 at 11:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Well, I wouldn't go that far. It seems pretty clear that remote_apply >> by itself is useful - I can't imagine anybody seriously arguing the >> contrary, whatever they think of this implementation. My view, >> though, is that by itself that's pretty limiting: you can only have >> one standby, and if that standby falls over then you lose >> availability. Causal reads fixes both of those problems - admittedly >> that requires some knowledge in the application or the pooler, but >> it's no worse than SSI in that regard. Still, half a loaf is better >> than none, and I imagine even just getting remote_apply would make a >> few people quite happy. > > OK, let's do so then, even if causal reads don't get into 9.6 users > could get advantage of remote_apply on multiple nodes if the N-sync > patch gets in. > > Just looking at 0001. > > - <literal>remote_write</>, <literal>local</>, and <literal>off</>. > + <literal>remote_write</>, <literal>remote_apply</>, > <literal>local</>, and <literal>off</>. > The default, and safe, setting > I imagine that a run of pgindent would be welcome for such large lines. I didn't think pgindent touched the docs. But I agree lines over 80 characters should be wrapped if they're being modified anyway. > +#define XactCompletionSyncApplyFeedback(xinfo) \ > + (!!(xinfo & XACT_COMPLETION_SYNC_APPLY_FEEDBACK)) > That's not directly something this patch should take care of, but the > notation "!!" has better be avoided (see stdbool thread with VS2015). +1. > - SyncRepWaitForLSN(gxact->prepare_end_lsn); > + SyncRepWaitForLSN(gxact->prepare_end_lsn, false); > Isn't it important to ensure that a PREPARE LSN is applied as well on > the standby with remote_apply? Say if an application prepares a > transaction, it would commit locally but its LSN may not be applied on > the standby with this patch. That would be a surprising behavior for > the user. You need to wait for COMMIT PREPARED, but I don't see why you need to wait for PREPARE itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Thu, Mar 24, 2016 at 11:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 24, 2016 at 2:11 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> - SyncRepWaitForLSN(gxact->prepare_end_lsn); >> + SyncRepWaitForLSN(gxact->prepare_end_lsn, false); >> Isn't it important to ensure that a PREPARE LSN is applied as well on >> the standby with remote_apply? Say if an application prepares a >> transaction, it would commit locally but its LSN may not be applied on >> the standby with this patch. That would be a surprising behavior for >> the user. > > You need to wait for COMMIT PREPARED, but I don't see why you need to > wait for PREPARE itself. Multi-master conflict resolution. Knowing in-time that a prepared transaction has been applied is useful on another node for lock conflicts resolution. Say a PRIMARY KEY insert is prepared, and we want to know at application level that its prepared state is visible everywhere. -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Thu, Mar 24, 2016 at 12:34 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Mar 23, 2016 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> +static void WalRcvUnblockSigUsr2(void) >> >> And again here. > > Fixed. > >> + WalRcvUnblockSigUsr2(); >> len = walrcv_receive(NAPTIME_PER_CYCLE, &buf); >> + WalRcvBlockSigUsr2(); >> >> This does not seem like it will be cheap on all operating systems. I >> think you should try to rejigger this somehow so that it can just set >> the process latch and the wal receiver figures it out from looking at >> shared memory. Like maybe a flag in WalRcvData? An advantage of this >> is that it should cut down on the number of signals significantly, >> because it won't need to send SIGUSR1 when the latch is already set. > > Still experimenting with a latch here. I will come back on this point soon. Here is a latch-based version. On Thu, Mar 24, 2016 at 7:11 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Just looking at 0001. > > - <literal>remote_write</>, <literal>local</>, and <literal>off</>. > + <literal>remote_write</>, <literal>remote_apply</>, > <literal>local</>, and <literal>off</>. > The default, and safe, setting > I imagine that a run of pgindent would be welcome for such large lines. Fixed. > +#define XactCompletionSyncApplyFeedback(xinfo) \ > + (!!(xinfo & XACT_COMPLETION_SYNC_APPLY_FEEDBACK)) > That's not directly something this patch should take care of, but the > notation "!!" has better be avoided (see stdbool thread with VS2015). Changed. > - SyncRepWaitForLSN(gxact->prepare_end_lsn); > + SyncRepWaitForLSN(gxact->prepare_end_lsn, false); > Isn't it important to ensure that a PREPARE LSN is applied as well on > the standby with remote_apply? Say if an application prepares a > transaction, it would commit locally but its LSN may not be applied on > the standby with this patch. That would be a surprising behavior for > the user. My reasoning here was that this isn't a commit, so you shouldn't wait for it to be applied (just like we don't wait for any other non-committed stuff to be applied), because it has no user-visible effect other than the ability to COMMIT PREPARED on the standby if it is promoted after that point. For that reason I do wait for it to be flushed. After it is flushed, it is guaranteed to be applied some time before the recovery completes and a user could potentially run COMMIT PREPARED on the newly promoted master. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Fri, Mar 25, 2016 at 4:51 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Mar 24, 2016 at 12:34 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Wed, Mar 23, 2016 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> +static void WalRcvUnblockSigUsr2(void) >>> >>> And again here. >> >> Fixed. >> >>> + WalRcvUnblockSigUsr2(); >>> len = walrcv_receive(NAPTIME_PER_CYCLE, &buf); >>> + WalRcvBlockSigUsr2(); >>> >>> This does not seem like it will be cheap on all operating systems. I >>> think you should try to rejigger this somehow so that it can just set >>> the process latch and the wal receiver figures it out from looking at >>> shared memory. Like maybe a flag in WalRcvData? An advantage of this >>> is that it should cut down on the number of signals significantly, >>> because it won't need to send SIGUSR1 when the latch is already set. >> >> Still experimenting with a latch here. I will come back on this point soon. > > Here is a latch-based version. Thanks for the updated version. This looks pretty nice. I find the routine name libpqrcv_wait to be a bit confusing. This is not a routine aimed at being exposed externally as walrcv_send or walrcv_receive. I would recommend changing the name, to something like waitForWALStream or similar. Should we worried about potential backward-incompatibilities with the new return values of walrcv_receive? Do you have numbers to share regarding how is performing the latch-based approach and the approach that used SIGUSR2 when remote_apply is used? -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Sat, Mar 26, 2016 at 2:48 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Mar 25, 2016 at 4:51 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Thu, Mar 24, 2016 at 12:34 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> On Wed, Mar 23, 2016 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> +static void WalRcvUnblockSigUsr2(void) >>>> >>>> And again here. >>> >>> Fixed. >>> >>>> + WalRcvUnblockSigUsr2(); >>>> len = walrcv_receive(NAPTIME_PER_CYCLE, &buf); >>>> + WalRcvBlockSigUsr2(); >>>> >>>> This does not seem like it will be cheap on all operating systems. I >>>> think you should try to rejigger this somehow so that it can just set >>>> the process latch and the wal receiver figures it out from looking at >>>> shared memory. Like maybe a flag in WalRcvData? An advantage of this >>>> is that it should cut down on the number of signals significantly, >>>> because it won't need to send SIGUSR1 when the latch is already set. >>> >>> Still experimenting with a latch here. I will come back on this point soon. >> >> Here is a latch-based version. > > Thanks for the updated version. This looks pretty nice. > > I find the routine name libpqrcv_wait to be a bit confusing. This is > not a routine aimed at being exposed externally as walrcv_send or > walrcv_receive. I would recommend changing the name, to something like > waitForWALStream or similar. Done (as "wait_for_wal_stream", following the case convention of nearby stuff). > Should we worried about potential backward-incompatibilities with the > new return values of walrcv_receive? There are three changes to the walrcv_receive interface: 1. It now takes a latch pointer, which may be NULL. 2. If the latch pointer is non-NULL, the existing function might return a new sentinel value WALRCV_RECEIVE_LATCH_SET. (The pre-existing sentinel value -1 is still in use and has the same value and meaning as before, but it now has a name: WALRCV_RECEIVE_COPY_ENDED.) 3. It will no longer return when the process is signalled (a latch should be used to ask it to return instead). Effectively, any client code would need to change at least to add NULL or possibly a latch if it needs to ask it to return, and any alternative implementation of the WAL receiver interface would need to use WaitEventSet (or WaitLatchOrSocket) as its event loop instead of whatever it might be using now so that it can detect a latch's state. But in any case, any such code would fail to compile against 9.6 due to the new argument, and then you'd only be able to get the new return value if you asked for it by passing in a latch. What affected code are we aware of -- either users of libpqwalreceiver.so or other WAL receiver implementations? > Do you have numbers to share regarding how is performing the > latch-based approach and the approach that used SIGUSR2 when > remote_apply is used? I couldn't measure any significant change (Linux, all on localhost, 128 cores): pgbench -c 1 -N bench2 -T 600 0001-remote-apply-v5.patch (signals), remote_apply -> 449 TPS 0001-remote-apply-v6.patch (latches), remote_apply -> 452 TPS pgbench -c 64 -j 32 -N bench2 -T 600 0001-remote-apply-v5.patch (signals), remote_apply -> 8536 TPS 0001-remote-apply-v6.patch (latches), remote_apply -> 8534 TPS Incidentally, I also did some testing on what happens when you signal a process that is busily writing and fsyncing. I tested a few different kernels, write sizes and disk latencies and saw that things were fine on all of them up to 10k signals/sec but after that some systems' fsync performance started to reduce. Only Linux on Power was still happily fsyncing at around 3/4 of full speed when signalled with a 2MHz+ tight kill loop (!), while FreeBSD and Linux on Intel weren't able to make much progress at all under such adversity. So I suppose that if you could get the commit rate up into 5 figures you might be able to measure an improvement for the latch version due to latch-collapsing, though I noticed a large amount of signal-collapsing going on at the OS level on all systems I tested anyway, so maybe it wouldn't make a difference. I attach that test program for interest. Also, I updated the comment for the declaration of the latch in walreceiver.h to say something about the new usage. New patch series attached. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Sun, Mar 27, 2016 at 7:30 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sat, Mar 26, 2016 at 2:48 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Should we worried about potential backward-incompatibilities with the >> new return values of walrcv_receive? > > There are three changes to the walrcv_receive interface: > > 1. It now takes a latch pointer, which may be NULL. > > 2. If the latch pointer is non-NULL, the existing function might > return a new sentinel value WALRCV_RECEIVE_LATCH_SET. (The > pre-existing sentinel value -1 is still in use and has the same value > and meaning as before, but it now has a name: > WALRCV_RECEIVE_COPY_ENDED.) > > 3. It will no longer return when the process is signalled (a latch > should be used to ask it to return instead). > > Effectively, any client code would need to change at least to add NULL > or possibly a latch if it needs to ask it to return, and any > alternative implementation of the WAL receiver interface would need to > use WaitEventSet (or WaitLatchOrSocket) as its event loop instead of > whatever it might be using now so that it can detect a latch's state. > But in any case, any such code would fail to compile against 9.6 due > to the new argument, and then you'd only be able to get the new return > value if you asked for it by passing in a latch. What affected code > are we aware of -- either users of libpqwalreceiver.so or other WAL > receiver implementations? Any negative value returned by walrcv_receive would have stopped the replication stream, not only -1. And as you say, it's actually a good thing that the interface of walrcv_receive is changing with this patch, this way compilation would just fail and failures would not happen discreetly. That's too low-level to be mentioned in the release notes either way, so just a compilation failure is acceptable to me. >> Do you have numbers to share regarding how is performing the >> latch-based approach and the approach that used SIGUSR2 when >> remote_apply is used? > > I couldn't measure any significant change (Linux, all on localhost, 128 cores): > > pgbench -c 1 -N bench2 -T 600 > > 0001-remote-apply-v5.patch (signals), remote_apply -> 449 TPS > 0001-remote-apply-v6.patch (latches), remote_apply -> 452 TPS > > pgbench -c 64 -j 32 -N bench2 -T 600 > > 0001-remote-apply-v5.patch (signals), remote_apply -> 8536 TPS > 0001-remote-apply-v6.patch (latches), remote_apply -> 8534 TPS Which concludes that both imply the same kind of performance. We are likely seeing noise. > Incidentally, I also did some testing on what happens when you signal > a process that is busily writing and fsyncing. I tested a few > different kernels, write sizes and disk latencies and saw that things > were fine on all of them up to 10k signals/sec but after that some > systems' fsync performance started to reduce. Only Linux on Power was > still happily fsyncing at around 3/4 of full speed when signalled with > a 2MHz+ tight kill loop (!), while FreeBSD and Linux on Intel weren't > able to make much progress at all under such adversity. So I suppose > that if you could get the commit rate up into 5 figures you might be > able to measure an improvement for the latch version due to > latch-collapsing, though I noticed a large amount of signal-collapsing > going on at the OS level on all systems I tested anyway, so maybe it > wouldn't make a difference. I attach that test program for interest. Interesting results. > Also, I updated the comment for the declaration of the latch in > walreceiver.h to say something about the new usage. > > New patch series attached. static void WalRcvQuickDieHandler(SIGNAL_ARGS); -static voidProcessWalRcvInterrupts(void) Noise here. + ret = WaitLatchOrSocket(latch, events, PQsocket(streamConn), timeout_ms); + + if (ret & WL_POSTMASTER_DEATH) + exit(0); Exiting within libpqwalreceiver.so is no good. I think that even in the case of a postmaster cleanup we should allow things to be cleaned up. /* + * Wake up the walreceiver if it happens to be blocked in walrcv_receive, + * and tell it that a commit record has been applied. + * + * This is called by the startup process whenever interesting xlog records + * are applied, so that walreceiver can check if it needs to send an apply + * notification back to the master which may be waiting in a COMMIT with + * synchronous_commit = remote_apply. + */ +void +WalRcvWakeup(void) +{ + SetLatch(&WalRcv->latch); +} I think here that it would be good to add an assertion telling that this can just be called by the startup process while in recovery, WalRcv->latch is not protected by a mutex lock. +maximum of 'timeout' ms. If a message was successfully read, returns +its length. Otherwise returns 0 for timeout, WALRCV_RECEIVE_COPY_ENDED +for disconnection or WALRCV_RECEIVE_LATCH_SET. On success, a pointer Having an assigned constant name for timeout would be good for consistency with the rest. I have been also thinking a lot about this patch, and the fact that the WAL receiver latch is being used within the internals of libpqwalreceiver has been bugging me a lot, because this makes the wait phase happening within the libpqwalreceiver depend on something that only the WAL receiver had a only control on up to now (among the things thought: having a second latch for libpqwalreceiver, having an event interface for libpqwalreceiver, switch libpq_receive into being asynchronous...). At the end, we need a way to allow the startup process to let the WAL receiver process know that it needs to be interrupted via shared memory, and that's the WAL receiver latch, the removal of epoll stuff cleans up some code at the end. So it seems that I finally made my mind on 0001 and it looks good to me except the small things mentioned above. -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Mon, Mar 28, 2016 at 8:54 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > static void WalRcvQuickDieHandler(SIGNAL_ARGS); > > - > static void > ProcessWalRcvInterrupts(void) > Noise here. Fixed. > + ret = WaitLatchOrSocket(latch, events, PQsocket(streamConn), timeout_ms); > + > + if (ret & WL_POSTMASTER_DEATH) > + exit(0); > Exiting within libpqwalreceiver.so is no good. I think that even in > the case of a postmaster cleanup we should allow things to be cleaned > up. I agree. I suppose this is really a symptom of the problem you talked about below: see response there. > /* > + * Wake up the walreceiver if it happens to be blocked in walrcv_receive, > + * and tell it that a commit record has been applied. > + * > + * This is called by the startup process whenever interesting xlog records > + * are applied, so that walreceiver can check if it needs to send an apply > + * notification back to the master which may be waiting in a COMMIT with > + * synchronous_commit = remote_apply. > + */ > +void > +WalRcvWakeup(void) > +{ > + SetLatch(&WalRcv->latch); > +} > I think here that it would be good to add an assertion telling that > this can just be called by the startup process while in recovery, > WalRcv->latch is not protected by a mutex lock. > > +maximum of 'timeout' ms. If a message was successfully read, returns > +its length. Otherwise returns 0 for timeout, WALRCV_RECEIVE_COPY_ENDED > +for disconnection or WALRCV_RECEIVE_LATCH_SET. On success, a pointer > Having an assigned constant name for timeout would be good for > consistency with the rest. Yeah, I guess it would have to mirror all the WL_XXX flags if we continue down that path, but... > I have been also thinking a lot about this patch, and the fact that > the WAL receiver latch is being used within the internals of > libpqwalreceiver has been bugging me a lot, because this makes the > wait phase happening within the libpqwalreceiver depend on something > that only the WAL receiver had a only control on up to now (among the > things thought: having a second latch for libpqwalreceiver, having an > event interface for libpqwalreceiver, switch libpq_receive into being > asynchronous...). Yeah, it bugs me too. Do you prefer this? int walrcv_receive(char **buffer, int *wait_fd); Return value -1 means end-of-copy as before, return value 0 means "no data available now, please call me again when *wait_fd is ready to read". Then walreceiver.c can look after the WaitLatchOrSocket call and deal with socket readiness, postmaster death, timeout and latch, and libpqwalreceiver.c doesn't know anything about all that stuff anymore, but it is now part of the interface that it must expose a file descriptor for readiness testing when it doesn't have data available. Please find attached a new patch series which does it that way. > At the end, we need a way to allow the startup > process to let the WAL receiver process know that it needs to be > interrupted via shared memory, and that's the WAL receiver latch, the > removal of epoll stuff cleans up some code at the end. So it seems > that I finally made my mind on 0001 and it looks good to me except the > small things mentioned above. Thanks! -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Tue, Mar 29, 2016 at 1:56 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Mar 28, 2016 at 8:54 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> I have been also thinking a lot about this patch, and the fact that >> the WAL receiver latch is being used within the internals of >> libpqwalreceiver has been bugging me a lot, because this makes the >> wait phase happening within the libpqwalreceiver depend on something >> that only the WAL receiver had a only control on up to now (among the >> things thought: having a second latch for libpqwalreceiver, having an >> event interface for libpqwalreceiver, switch libpq_receive into being >> asynchronous...). > > Yeah, it bugs me too. Do you prefer this? > > int walrcv_receive(char **buffer, int *wait_fd); > > Return value -1 means end-of-copy as before, return value 0 means "no > data available now, please call me again when *wait_fd is ready to > read". Then walreceiver.c can look after the WaitLatchOrSocket call > and deal with socket readiness, postmaster death, timeout and latch, > and libpqwalreceiver.c doesn't know anything about all that stuff > anymore, but it is now part of the interface that it must expose a > file descriptor for readiness testing when it doesn't have data > available. > > Please find attached a new patch series which does it that way. Oops, there is a bug in the primary disconnection case when len == 1 and it breaks out of the loop and wait_fd is invalid. I'll follow up on that tomorrow, but I'm interested to hear your thoughts (and anyone else's!) on that interface change and general approach. -- Thomas Munro http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Mon, Mar 28, 2016 at 10:08 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Mar 29, 2016 at 1:56 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Mon, Mar 28, 2016 at 8:54 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> I have been also thinking a lot about this patch, and the fact that >>> the WAL receiver latch is being used within the internals of >>> libpqwalreceiver has been bugging me a lot, because this makes the >>> wait phase happening within the libpqwalreceiver depend on something >>> that only the WAL receiver had a only control on up to now (among the >>> things thought: having a second latch for libpqwalreceiver, having an >>> event interface for libpqwalreceiver, switch libpq_receive into being >>> asynchronous...). >> >> Yeah, it bugs me too. Do you prefer this? >> >> int walrcv_receive(char **buffer, int *wait_fd); >> >> Return value -1 means end-of-copy as before, return value 0 means "no >> data available now, please call me again when *wait_fd is ready to >> read". Then walreceiver.c can look after the WaitLatchOrSocket call >> and deal with socket readiness, postmaster death, timeout and latch, >> and libpqwalreceiver.c doesn't know anything about all that stuff >> anymore, but it is now part of the interface that it must expose a >> file descriptor for readiness testing when it doesn't have data >> available. >> >> Please find attached a new patch series which does it that way. > > Oops, there is a bug in the primary disconnection case when len == 1 > and it breaks out of the loop and wait_fd is invalid. I'll follow up > on that tomorrow, but I'm interested to hear your thoughts (and anyone > else's!) on that interface change and general approach. I definitely prefer that, that's neater! libpq_select could be simplified because a timeout does not matter much. -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Tue, Mar 29, 2016 at 2:28 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Mar 28, 2016 at 10:08 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Tue, Mar 29, 2016 at 1:56 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> On Mon, Mar 28, 2016 at 8:54 PM, Michael Paquier >>> <michael.paquier@gmail.com> wrote: >>>> I have been also thinking a lot about this patch, and the fact that >>>> the WAL receiver latch is being used within the internals of >>>> libpqwalreceiver has been bugging me a lot, because this makes the >>>> wait phase happening within the libpqwalreceiver depend on something >>>> that only the WAL receiver had a only control on up to now (among the >>>> things thought: having a second latch for libpqwalreceiver, having an >>>> event interface for libpqwalreceiver, switch libpq_receive into being >>>> asynchronous...). >>> >>> Yeah, it bugs me too. Do you prefer this? >>> >>> int walrcv_receive(char **buffer, int *wait_fd); >>> >>> Return value -1 means end-of-copy as before, return value 0 means "no >>> data available now, please call me again when *wait_fd is ready to >>> read". Then walreceiver.c can look after the WaitLatchOrSocket call >>> and deal with socket readiness, postmaster death, timeout and latch, >>> and libpqwalreceiver.c doesn't know anything about all that stuff >>> anymore, but it is now part of the interface that it must expose a >>> file descriptor for readiness testing when it doesn't have data >>> available. >>> >>> Please find attached a new patch series which does it that way. >> >> Oops, there is a bug in the primary disconnection case when len == 1 >> and it breaks out of the loop and wait_fd is invalid. I'll follow up >> on that tomorrow, but I'm interested to hear your thoughts (and anyone >> else's!) on that interface change and general approach. > > I definitely prefer that, that's neater! libpq_select could be > simplified because a timeout does not matter much. Ok, here is a new version that exits the streaming loop correctly when endofwal becomes true. To hit that codepath you have to set up a cascading standby with recovery_target_timeline = 'latest', and then promote the standby it's talking to. I also got rid of the PostmasterIsAlive() check which became superfluous. You're right that libpq_select is now only ever called with timeout = -1 so could theoretically lose the parameter, but I decided against cluttering this patch up by touching that for now. It seems like the only reason it's used by libpqrcv_PQexec is something to do with interrupts on Windows, which I'm not able to test so that was another reason not to touch it. (BTW, isn't the select call in libpq_select lacking an exceptfds set, and can't it therefore block forever when there is an error condition on the socket and no timeout?) -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Michael Paquier
Date:
On Tue, Mar 29, 2016 at 1:11 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Mar 29, 2016 at 2:28 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> I definitely prefer that, that's neater! libpq_select could be >> simplified because a timeout does not matter much. > > Ok, here is a new version that exits the streaming loop correctly when > endofwal becomes true. To hit that codepath you have to set up a > cascading standby with recovery_target_timeline = 'latest', and then > promote the standby it's talking to. I also got rid of the > PostmasterIsAlive() check which became superfluous. Yes, I can see the difference. > You're right that libpq_select is now only ever called with timeout = > -1 so could theoretically lose the parameter, but I decided against > cluttering this patch up by touching that for now. It seems like the > only reason it's used by libpqrcv_PQexec is something to do with > interrupts on Windows, which I'm not able to test so that was another > reason not to touch it. OK. I don't mind if the first patch is bare-bone. That's additional cleanup. > (BTW, isn't the select call in libpq_select > lacking an exceptfds set, and can't it therefore block forever when > there is an error condition on the socket and no timeout?) Hm. I think you're right here when timeout is NULL... It would loop infinitely. @Andres (in CC): your thoughts on that regarding the new WaitEventSetWaitBlock()? The same pattern is used there. -bool walrcv_receive(int timeout, unsigned char *type, char **buffer, int *len) - -Retrieve any message available through the connection, blocking for Oh, the description of walrcv_receive is actually incorrect in src/backend/replication/README from the beginning... I am sure you noticed that as well. Perhaps that's worth fixing in the back-branches (I think it does matter). Thoughts from others? OK, so I am switching this patch as "Ready for committer", for 0001. It is in better shape now. -- Michael
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Tue, Mar 29, 2016 at 3:17 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > OK, so I am switching this patch as "Ready for committer", for 0001. > It is in better shape now. Well... I have a few questions yet. The new argument to SyncRepWaitForLSN is called "bool commit", but RecordTransactionAbortPrepared passes true. Either it should be passing false, or the parameter is misnamed or at the least in need of a better comment. I don't understand why this patch is touching the abort paths at all. XactLogAbortRecord sets XACT_COMPLETION_SYNC_APPLY_FEEDBACK, and xact_redo_abort honors it. But surely it makes no sense to wait for an abort to become visible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Wed, Mar 30, 2016 at 6:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 29, 2016 at 3:17 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> OK, so I am switching this patch as "Ready for committer", for 0001. >> It is in better shape now. > > Well... I have a few questions yet. > > The new argument to SyncRepWaitForLSN is called "bool commit", but > RecordTransactionAbortPrepared passes true. Either it should be > passing false, or the parameter is misnamed or at the least in need of > a better comment. > > I don't understand why this patch is touching the abort paths at all. > XactLogAbortRecord sets XACT_COMPLETION_SYNC_APPLY_FEEDBACK, and > xact_redo_abort honors it. But surely it makes no sense to wait for > an abort to become visible. You're right, that was totally unnecessary. Here is a version that removes that (ie XactLogAbortRecord doesn't request apply feedback from the standby, xact_redo_abort doesn't send apply feedback to the primary and RecordTransactionAbortPrepared now passes false to SyncRepWaitForLSN so it doesn't wait for apply feedback from the standby). Also I fixed a silly bug in SyncRepWaitForLSN when capping the mode. I have also renamed XACT_COMPLETION_SYNC_APPLY_FEEDBACK to the more general XACT_COMPLETION_APPLY_FEEDBACK, because the later 0004 patch will use it for a more general purpose than synchronous_commit. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Tue, Mar 29, 2016 at 5:47 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Mar 30, 2016 at 6:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Mar 29, 2016 at 3:17 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> OK, so I am switching this patch as "Ready for committer", for 0001. >>> It is in better shape now. >> >> Well... I have a few questions yet. >> >> The new argument to SyncRepWaitForLSN is called "bool commit", but >> RecordTransactionAbortPrepared passes true. Either it should be >> passing false, or the parameter is misnamed or at the least in need of >> a better comment. >> >> I don't understand why this patch is touching the abort paths at all. >> XactLogAbortRecord sets XACT_COMPLETION_SYNC_APPLY_FEEDBACK, and >> xact_redo_abort honors it. But surely it makes no sense to wait for >> an abort to become visible. > > You're right, that was totally unnecessary. Here is a version that > removes that (ie XactLogAbortRecord doesn't request apply feedback > from the standby, xact_redo_abort doesn't send apply feedback to the > primary and RecordTransactionAbortPrepared now passes false to > SyncRepWaitForLSN so it doesn't wait for apply feedback from the > standby). Also I fixed a silly bug in SyncRepWaitForLSN when capping > the mode. I have also renamed XACT_COMPLETION_SYNC_APPLY_FEEDBACK to > the more general XACT_COMPLETION_APPLY_FEEDBACK, because the later > 0004 patch will use it for a more general purpose than > synchronous_commit. OK, I committed this, with a few tweaks. In particular, I added a flag variable instead of relying on "latch set" == "need to send reply"; the other changes were cosmetic. I'm not sure how much more of this we can realistically get into 9.6; the latter patches haven't had much review yet. But I'll set this back to Needs Review in the CommitFest and we'll see where we end up. But even if we don't get anything more than this, it's still rather nice: remote_apply turns out to be only slightly slower than remote flush, and it's a guarantee that a lot of people are looking for. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Wed, Mar 30, 2016 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote: > OK, I committed this, with a few tweaks. In particular, I added a > flag variable instead of relying on "latch set" == "need to send > reply"; the other changes were cosmetic. > > I'm not sure how much more of this we can realistically get into 9.6; > the latter patches haven't had much review yet. But I'll set this > back to Needs Review in the CommitFest and we'll see where we end up. > But even if we don't get anything more than this, it's still rather > nice: remote_apply turns out to be only slightly slower than remote > flush, and it's a guarantee that a lot of people are looking for. Thank you Michael and Robert! Please find attached the rest of the patch series, rebased against master. The goal of the 0002 patch is to provide an accurate indication of the current replay lag on each standby, visible to users like this: postgres=# select application_name, replay_lag from pg_stat_replication; application_name │ replay_lag ──────────────────┼───────────────── replica1 │ 00:00:00.000299 replica2 │ 00:00:00.000323 replica3 │ 00:00:00.000319 replica4 │ 00:00:00.000303 (4 rows) It works by maintaining a buffer of (end of WAL, time now) samples received from the primary, and then eventually feeding those times back to the primary when the recovery process replays the corresponding locations. Compared to approaches based on commit timestamps, this approach has the advantage of providing non-misleading information between commits. For example, if you run a batch load job that takes 1 minute to insert the whole phonebook and no other transactions run, you will see replay_lag updating regularly throughout that minute, whereas typical commit timestamp-only approaches will show an increasing lag time until a commit record is eventually applied. Compared to simple LSN location comparisons, it reports in time rather than bytes of WAL, which can be more meaningful for DBAs. When the standby is entirely caught up and there is no write activity, the reported time effectively represents the ping time between the servers, and is updated every wal_sender_timeout / 2, when keepalive messages are sent. While new WAL traffic is arriving, the walreceiver records timestamps at most once per second in a circular buffer, and then sends back replies containing the recorded timestamps as fast as the recovery process can apply the corresponding xlog. The lag number you see is computed by the primary server comparing two timestamps generated by its own system clock, one of which has been on a journey to the standby and back. Accurate lag estimates are a prerequisite for the 0004 patch (about which more later), but I believe users would find this valuable as a feature on its own. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Wed, Mar 30, 2016 at 2:22 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Mar 30, 2016 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> OK, I committed this, with a few tweaks. In particular, I added a >> flag variable instead of relying on "latch set" == "need to send >> reply"; the other changes were cosmetic. >> >> I'm not sure how much more of this we can realistically get into 9.6; >> the latter patches haven't had much review yet. But I'll set this >> back to Needs Review in the CommitFest and we'll see where we end up. >> But even if we don't get anything more than this, it's still rather >> nice: remote_apply turns out to be only slightly slower than remote >> flush, and it's a guarantee that a lot of people are looking for. > > Thank you Michael and Robert! > > Please find attached the rest of the patch series, rebased against > master. The goal of the 0002 patch is to provide an accurate > indication of the current replay lag on each standby, visible to users > like this: > > postgres=# select application_name, replay_lag from pg_stat_replication; > application_name │ replay_lag > ──────────────────┼───────────────── > replica1 │ 00:00:00.000299 > replica2 │ 00:00:00.000323 > replica3 │ 00:00:00.000319 > replica4 │ 00:00:00.000303 > (4 rows) > > It works by maintaining a buffer of (end of WAL, time now) samples > received from the primary, and then eventually feeding those times > back to the primary when the recovery process replays the > corresponding locations. > > Compared to approaches based on commit timestamps, this approach has > the advantage of providing non-misleading information between commits. > For example, if you run a batch load job that takes 1 minute to insert > the whole phonebook and no other transactions run, you will see > replay_lag updating regularly throughout that minute, whereas typical > commit timestamp-only approaches will show an increasing lag time > until a commit record is eventually applied. Compared to simple LSN > location comparisons, it reports in time rather than bytes of WAL, > which can be more meaningful for DBAs. > > When the standby is entirely caught up and there is no write activity, > the reported time effectively represents the ping time between the > servers, and is updated every wal_sender_timeout / 2, when keepalive > messages are sent. While new WAL traffic is arriving, the walreceiver > records timestamps at most once per second in a circular buffer, and > then sends back replies containing the recorded timestamps as fast as > the recovery process can apply the corresponding xlog. The lag number > you see is computed by the primary server comparing two timestamps > generated by its own system clock, one of which has been on a journey > to the standby and back. > > Accurate lag estimates are a prerequisite for the 0004 patch (about > which more later), but I believe users would find this valuable as a > feature on its own. Well, one problem with this is that you can't put a loop inside of a spinlock-protected critical section. In general, I think this is a pretty reasonable way of attacking this problem, but I'd say it's significantly under-commented. Where should someone go to get a general overview of this mechanism? The answer is not "at place XXX within the patch". (I think it might merit some more extensive documentation, too, although I'm not exactly sure what that should look like.) When you overflow the buffer, you could thin in out in a smarter way, like by throwing away every other entry instead of the oldest one. I guess you'd need to be careful how you coded that, though, because replaying an entry with a timestamp invalidates some of the saved entries without formally throwing them out. Conceivably, 0002 could be split into two patches, one of which computes "stupid replay lag" considering only records that naturally carry timestamps, and a second adding the circular buffer to handle the case where much time passes without finding such a record. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Tue, Apr 5, 2016 at 4:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 30, 2016 at 2:22 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Wed, Mar 30, 2016 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> OK, I committed this, with a few tweaks. In particular, I added a >>> flag variable instead of relying on "latch set" == "need to send >>> reply"; the other changes were cosmetic. >>> >>> I'm not sure how much more of this we can realistically get into 9.6; >>> the latter patches haven't had much review yet. But I'll set this >>> back to Needs Review in the CommitFest and we'll see where we end up. >>> But even if we don't get anything more than this, it's still rather >>> nice: remote_apply turns out to be only slightly slower than remote >>> flush, and it's a guarantee that a lot of people are looking for. >> >> Thank you Michael and Robert! >> >> Please find attached the rest of the patch series, rebased against >> master. The goal of the 0002 patch is to provide an accurate >> indication of the current replay lag on each standby, visible to users >> like this: >> >> postgres=# select application_name, replay_lag from pg_stat_replication; >> application_name │ replay_lag >> ──────────────────┼───────────────── >> replica1 │ 00:00:00.000299 >> replica2 │ 00:00:00.000323 >> replica3 │ 00:00:00.000319 >> replica4 │ 00:00:00.000303 >> (4 rows) >> >> It works by maintaining a buffer of (end of WAL, time now) samples >> received from the primary, and then eventually feeding those times >> back to the primary when the recovery process replays the >> corresponding locations. >> >> Compared to approaches based on commit timestamps, this approach has >> the advantage of providing non-misleading information between commits. >> For example, if you run a batch load job that takes 1 minute to insert >> the whole phonebook and no other transactions run, you will see >> replay_lag updating regularly throughout that minute, whereas typical >> commit timestamp-only approaches will show an increasing lag time >> until a commit record is eventually applied. Compared to simple LSN >> location comparisons, it reports in time rather than bytes of WAL, >> which can be more meaningful for DBAs. >> >> When the standby is entirely caught up and there is no write activity, >> the reported time effectively represents the ping time between the >> servers, and is updated every wal_sender_timeout / 2, when keepalive >> messages are sent. While new WAL traffic is arriving, the walreceiver >> records timestamps at most once per second in a circular buffer, and >> then sends back replies containing the recorded timestamps as fast as >> the recovery process can apply the corresponding xlog. The lag number >> you see is computed by the primary server comparing two timestamps >> generated by its own system clock, one of which has been on a journey >> to the standby and back. >> >> Accurate lag estimates are a prerequisite for the 0004 patch (about >> which more later), but I believe users would find this valuable as a >> feature on its own. > > Well, one problem with this is that you can't put a loop inside of a > spinlock-protected critical section. > > In general, I think this is a pretty reasonable way of attacking this > problem, but I'd say it's significantly under-commented. Where should > someone go to get a general overview of this mechanism? The answer is > not "at place XXX within the patch". (I think it might merit some > more extensive documentation, too, although I'm not exactly sure what > that should look like.) > > When you overflow the buffer, you could thin in out in a smarter way, > like by throwing away every other entry instead of the oldest one. I > guess you'd need to be careful how you coded that, though, because > replaying an entry with a timestamp invalidates some of the saved > entries without formally throwing them out. > > Conceivably, 0002 could be split into two patches, one of which > computes "stupid replay lag" considering only records that naturally > carry timestamps, and a second adding the circular buffer to handle > the case where much time passes without finding such a record. Thanks. I see a way to move that loop and change the overflow behaviour along those lines but due to other commitments I won't be able to post a well tested patch and still leave time for reviewers and committer before the looming deadline. After the freeze I will post an updated version that addresses these problems for the next CF. -- Thomas Munro http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Robert Haas
Date:
On Tue, Apr 5, 2016 at 7:21 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Thanks. I see a way to move that loop and change the overflow > behaviour along those lines but due to other commitments I won't be > able to post a well tested patch and still leave time for reviewers > and committer before the looming deadline. After the freeze I will > post an updated version that addresses these problems for the next CF. OK, I've marked this Returned with Feedback for now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Thomas Munro
Date:
On Tue, Mar 29, 2016 at 8:17 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Mar 29, 2016 at 1:11 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> (BTW, isn't the select call in libpq_select >> lacking an exceptfds set, and can't it therefore block forever when >> there is an error condition on the socket and no timeout?) > > Hm. I think you're right here when timeout is NULL... It would loop infinitely. > @Andres (in CC): your thoughts on that regarding the new > WaitEventSetWaitBlock()? The same pattern is used there. That was a red herring. I was confused because SUSv2 and POSIX call this argument 'errorfds' and say that sockets *also* tell you about errors this way. (Many/most real OSs call the argument 'exceptfds' instead and only use it to tell you about out-of-band data and possibly implementation specific events for devices, pseudo-terminals etc. If you want to know about errors on a socket it's enough to have it in readfds/writefds, and insufficient to have it only in errorfds/exceptfds unless you can find a computer that actually conforms to POSIX.) -- Thomas Munro http://www.enterprisedb.com
Re: Proposal: "Causal reads" mode for load balancing reads without stale data
From
Andres Freund
Date:
On 2016-05-05 13:30:42 +1200, Thomas Munro wrote: > That was a red herring. I was confused because SUSv2 and POSIX call > this argument 'errorfds' and say that sockets *also* tell you about > errors this way. (Many/most real OSs call the argument 'exceptfds' > instead and only use it to tell you about out-of-band data and > possibly implementation specific events for devices, pseudo-terminals > etc. If you want to know about errors on a socket it's enough to have > it in readfds/writefds, and insufficient to have it only in > errorfds/exceptfds unless you can find a computer that actually > conforms to POSIX.) Correct, exceptfds is pretty much meaningless for anything we do in postgres. We rely on select returning a socket as read/writeable if the socket has hung up. That's been the case *before* the recent WaitEventSet refactoring, so I think we're fairly solid on relying on that. Greetings, Andres Freund