Hello, hackers.
Currently hint bits in the index pages (dead tuples) are set and taken
into account only at primary server. Standby just ignores it. It is
done for reasons, of course (see RelationGetIndexScan and [1]):
* We do this because the xmin on the primary node could easily be
* later than the xmin on the standby node, so that what the primary
* thinks is killed is supposed to be visible on standby. So for correct
* MVCC for queries during recovery we must ignore these hints and check
* all tuples.
Also, according to [2] and cases like [3] it seems to be good idea to
support "ignore_killed_tuples" on standby.
I hope I know the way to support it correctly with reasonable amount of changes.
First thing we need to consider - checksums and wal_log_hints are
widely used these days. So, at any moment master could send FPW page
with new "killed tuples" hints and overwrite hints set by standby.
Moreover it is not possible to distinguish hints are set by primary or standby.
And there is where hot_standby_feedback comes to play. Master node
considers xmin of hot_standy_feedback replicas (RecentGlobalXmin)
while setting "killed tuples" bits. So, if hot_standby_feedback is
enabled on standby for a while - it could safely trust hint bits from
master.
Also, standby could set own hints using xmin it sends to primary
during feedback (but without marking page as dirty).
Of course all is not so easy, there are a few things and corner cases
to care about
* Looks like RecentGlobalXmin could be moved backwards in case of new
replica with lower xmin is connected (or by switching some replica to
hot_standby_feedback=on). We must ensure RecentGlobalXmin is moved
strictly forward.
* hot_standby_feedback could be enabled on the fly. In such a case we
need distinguish transactions which are safe or unsafe to deal with
hints. Standby could receive fresh RecentGlobalXmin as response to
feedback message. All standby transactions with xmin >=
RecentGlobalXmin are safe to use hints.
* hot_standby_feedback could be disabled on the fly. In such situation
standby needs to continue to send feedback while canceling all queries
with ignore_killed_tuples=true. Once all such queries are canceled -
feedback are no longer needed and should be disabled.
Could someone validate my thoughts please? If the idea is mostly
correct - I could try to implement and test it.
[1] - https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38
[2] - https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8
[3] - https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org