Detecting skipped data from logical slots (data silently skipped) - Mailing list pgsql-hackers

From Craig Ringer
Subject Detecting skipped data from logical slots (data silently skipped)
Date
Msg-id CAMsr+YF7RmCgbaTaLgCHhdaA89=p9r3UegGFaVdJA1GBM-gB1Q@mail.gmail.com
Whole thread Raw
Responses Re: Detecting skipped data from logical slots (data silently skipped)  (Greg Stark <stark@mit.edu>)
Re: Detecting skipped data from logical slots (data silently skipped)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi all

I think we have a bit of a problem with the behaviour specified for logical slots, one that makes it hard to prevent a outdated snapshot or backup of a logical-slot-using downstream from knowing it's missing a chunk of data that's been consumed from a slot. That's not great since slots are supposed to ensure a continuous, gapless data stream.

If the downstream requests that logical decoding restarts at an LSN older than the slot's confirmed_flush_lsn, we silently ignore the client's request and start replay at the confirmed_flush_lsn. That's by design and fine normally, since we know the gap LSNs contained no transactions of interest to the downstream.

But it's *bad* if the downstream is actually a copy of the original downstream that's been started from a restored backup/snapshot. In that case the downstream won't know that some other client, probably a newer instance of its self, consumed rows it should've seen. It'll merrily continue replaying and not know it isn't consistent.

The cause is an optimisation intended to allow the downstream to avoid having to do local writes and flushes when the upstream's activity isn't of interest to it and doesn't result in replicated rows. When the upstream does a bunch of writes to another database or otherwise produces WAL not of interest to the downstream we send the downstream keepalive messages that include the upstream's current xlog position and the client replies to acknowledge it's seen the new LSN. But, so that we can avoid disk flushes on the downstream, we permit it to skip advancing its replication origin in response to those keepalives. We continue to advance the confirmed_flush_lsn and restart_lsn in the replication slot on the upstream so we can free WAL that's not needed and move the catalog_xmin up. The replication origin on the downstream falls behind the confirmed_flush_lsn on the upstream.

This means that if the downstream exits/crashes before receiving some new row, its replication origin will tell it that it last replayed some LSN older than what it really did, and older than what the server retains. Logical decoding doesn't allow the client to "jump backwards" and replay anything older than the confirmed_lsn. Since we "know" that the gap cannot contain rows of interest, otherwise we'd have updated the replication origin, we just skip and start replay at the confirmed_flush_lsn.

That means that if the downstream is restored from a backup it has no way of knowing it can't rejoin and become consistent because it can't tell the difference between "everything's fine, replication origin intentionally behind confirmed_flush_lsn due to activity not of interest" and "we've missed data consumed from this slot by some other peer and should refuse to continue replay".

The simplest fix would be to require downstreams to flush their replication origin when they get a hot standby feedback message, before they send a reply with confirmation. That could be somewhat painful for performance, but can be alleviated somewhat by waiting for the downstream postgres to get around to doing a flush anyway and only forcing it if we're getting close to the walsender timeout. That's pretty much what BDR and pglogical do when applying transactions to avoid having to do a disk flush for each and every applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if you ask for a too-old LSN rather than silently ignoring it.

This problem can also bite you if you restore a copy of a downstream (say, to look at since-deleted data) while the original happens to be disconnected for some reason. The copy connects to the upstream and consumes some data from the slot. Then when the original comes back on line it has no idea there's a gap in its time stream.

This came up when investigating issues with people using snapshot-based BDR and pglogical backup/restore. It's a real-world problem that can result in silent data inconsistency.

Thoughts on the proposed fix? Any ideas for lower-impact fixes that'd still allow a downstream to find out if it's missed data?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Why we lost Uber as a user
Next
From: Craig Ringer
Date:
Subject: Re: Why we lost Uber as a user