Re: Detecting skipped data from logical slots (data silently skipped) - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Detecting skipped data from logical slots (data silently skipped)
Date
Msg-id CAMsr+YGzr77tDd=hb930KMyJx2PCYJ6gXoNFU1R8JQv_OFPLag@mail.gmail.com
Whole thread Raw
In response to Re: Detecting skipped data from logical slots (data silently skipped)  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 17 August 2016 at 05:18, Andres Freund <andres@anarazel.de> wrote:
On 2016-08-08 10:59:20 +0800, Craig Ringer wrote:
> Right. Though if we flush lazily I'm surprised the effect is that big,
> you're the one who did the work and knows the significance of it.

It will be. Either you're increasing bloat (by not increasing the
slot's wal position / catalog xmin), or you're adding frequent syncs on
an idle connection.


My thinking is that we should be able to do it lazily, like we do already with feedback during apply of changes.  The problem is that right now we can't tell the difference between confirmed_flush_lsn advances in response to keepalives when there's no interesting upstream activity, and advances when the client replays and confirms real activity of interest. So we can add a new field in logical slots that tracks the last confirmed_flush_lsn update that occurred as a result of an actual write to the client rather than keepalive responses. No new resource retention is required, no new client messages, no new protocol fields. Just one new field in a logical slot.



* Add a new field, say last_write_lsn, in slots. A logical slot updates this whenever an output plugin sends something to the client in response to a callback. last_write_lsn is not advanced along with confirmed_flush_lsn when we just skip over data that's not of interest like writes to other DBs or changes that are filtered out by the output plugin, only when the output plugin actually sends something to the client.

* A candidate_last_write_lsn type mechanism is needed to ensure we don't flush out advances of last_write_lsn before we've got client feedback to confirm it flushed the changes resulting from the output plugin writes. The same sort of logic as used for candidate_restart_lsn & restart_lsn will work fine, but we don't have to make sure it's flushed like we do with restart_lsn, we can just dirty the slot and wait for the next slot checkpoint - it's pretty harmless if candidate_last_write_lsn is older than reality, it just adds a small window where we won't detect lost changes.

* Clients like BDR and pglogical already send feedback lazily. They track the server's flush position and sending feedback for an upstream lsn when we know the corresponding downstream writes and associated replication origin advances have been flushed to disk. (As you know, having written it). Behaviour during normal apply doesn't need to change. Neither does behaviour during idle; clients don't have to advance their replication origin in response to server keepalives, though they may do so lazily.

*  When a client starts a new decoding session we check last_write_lsn against the client-requested LSN from the client's replication origin. We ERROR if last_write_lsn is newer than the LSN requested by the client, indicating that the client is trying to replay changes it or someone else using the same slot has already seen and confirmed.

*  catalog_xmin advances and WAL removal are NOT limited by last_write_lsn, we can freely remove WAL after last_write_lsn and vacuum catalogs. On reconnect we continue to skip to confirmed_flush_lsn if asked for an older LSN, just like we currently do. The difference is that now we know we're skipping data that wasn't of interest to the client so it didn't result in eager client side replication origin advances.


Think of last_write_lsn as "the value of confirmed_flush_lsn last time the client actually flushed something interesting". We can safely skip from any value >= last_write_lsn to the current slot confirmed_lsn if asked to start replay at any LSN within that range. We CANNOT safely skip from < last_write_lsn to confirmed_flush_lsn since we know the client would miss data it already received and confirmed but seems to have forgotten due to lying fsync(), restore from snapshot backup, etc.

We'd need more flushes on the upstream only if we were going to try to guarantee that we'd detect all lost changes from a client, since last_write_lsn would need flushing in response to every client feedback message during apply (but not idle). Even then the client could've flushed more changes we haven't got feedback for yet, so it's not really possible to totally prevent the problem. I don't think total prevention is too interesting though. A window since the last slot checkpoint where we don't detect problems *if* the server has also crashed and restarted isn't too bad and is a lot better than the current situation.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Gerdan Santos
Date:
Subject: Re: tab completion for alter extension
Next
From: Amit Langote
Date:
Subject: Re: Declarative partitioning - another take