Re: Proposal: "Causal reads" mode for load balancing reads without stale data - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Date
Msg-id CAEepm=31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w@mail.gmail.com
Whole thread Raw
In response to Re: Proposal: "Causal reads" mode for load balancing reads without stale data  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Proposal: "Causal reads" mode for load balancing reads without stale data  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Apr 5, 2016 at 4:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 30, 2016 at 2:22 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Wed, Mar 30, 2016 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> OK, I committed this, with a few tweaks.  In particular, I added a
>>> flag variable instead of relying on "latch set" == "need to send
>>> reply"; the other changes were cosmetic.
>>>
>>> I'm not sure how much more of this we can realistically get into 9.6;
>>> the latter patches haven't had much review yet.  But I'll set this
>>> back to Needs Review in the CommitFest and we'll see where we end up.
>>> But even if we don't get anything more than this, it's still rather
>>> nice: remote_apply turns out to be only slightly slower than remote
>>> flush, and it's a guarantee that a lot of people are looking for.
>>
>> Thank you Michael and Robert!
>>
>> Please find attached the rest of the patch series, rebased against
>> master.  The goal of the 0002 patch is to provide an accurate
>> indication of the current replay lag on each standby, visible to users
>> like this:
>>
>> postgres=# select application_name, replay_lag from pg_stat_replication;
>>  application_name │   replay_lag
>> ──────────────────┼─────────────────
>>  replica1         │ 00:00:00.000299
>>  replica2         │ 00:00:00.000323
>>  replica3         │ 00:00:00.000319
>>  replica4         │ 00:00:00.000303
>> (4 rows)
>>
>> It works by maintaining a buffer of (end of WAL, time now) samples
>> received from the primary, and then eventually feeding those times
>> back to the primary when the recovery process replays the
>> corresponding locations.
>>
>> Compared to approaches based on commit timestamps, this approach has
>> the advantage of providing non-misleading information between commits.
>> For example, if you run a batch load job that takes 1 minute to insert
>> the whole phonebook and no other transactions run, you will see
>> replay_lag updating regularly throughout that minute, whereas typical
>> commit timestamp-only approaches will show an increasing lag time
>> until a commit record is eventually applied.  Compared to simple LSN
>> location comparisons, it reports in time rather than bytes of WAL,
>> which can be more meaningful for DBAs.
>>
>> When the standby is entirely caught up and there is no write activity,
>> the reported time effectively represents the ping time between the
>> servers, and is updated every wal_sender_timeout / 2, when keepalive
>> messages are sent.  While new WAL traffic is arriving, the walreceiver
>> records timestamps at most once per second in a circular buffer, and
>> then sends back replies containing the recorded timestamps as fast as
>> the recovery process can apply the corresponding xlog.  The lag number
>> you see is computed by the primary server comparing two timestamps
>> generated by its own system clock, one of which has been on a journey
>> to the standby and back.
>>
>> Accurate lag estimates are a prerequisite for the 0004 patch (about
>> which more later), but I believe users would find this valuable as a
>> feature on its own.
>
> Well, one problem with this is that you can't put a loop inside of a
> spinlock-protected critical section.
>
> In general, I think this is a pretty reasonable way of attacking this
> problem, but I'd say it's significantly under-commented.  Where should
> someone go to get a general overview of this mechanism?  The answer is
> not "at place XXX within the patch".  (I think it might merit some
> more extensive documentation, too, although I'm not exactly sure what
> that should look like.)
>
> When you overflow the buffer, you could thin in out in a smarter way,
> like by throwing away every other entry instead of the oldest one.  I
> guess you'd need to be careful how you coded that, though, because
> replaying an entry with a timestamp invalidates some of the saved
> entries without formally throwing them out.
>
> Conceivably, 0002 could be split into two patches, one of which
> computes "stupid replay lag" considering only records that naturally
> carry timestamps, and a second adding the circular buffer to handle
> the case where much time passes without finding such a record.

Thanks.  I see a way to move that loop and change the overflow
behaviour along those lines but due to other commitments I won't be
able to post a well tested patch and still leave time for reviewers
and committer before the looming deadline.  After the freeze I will
post an updated version that addresses these problems for the next CF.

--
Thomas Munro
http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [PATCH v12] GSSAPI encryption support
Next
From: Michael Paquier
Date:
Subject: Re: Unused macros in src/include/access/transam.h