Re: Thoughts on "killed tuples" index hint bits support on standby - Mailing list pgsql-hackers

From Michail Nikolaev
Subject Re: Thoughts on "killed tuples" index hint bits support on standby
Date
Msg-id CANtu0ogCZveeVXmJWRO+_eWzrXX42z1LuykwgpZTCqv-5jErew@mail.gmail.com
Whole thread Raw
In response to Re: Thoughts on "killed tuples" index hint bits support on standby  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Thoughts on "killed tuples" index hint bits support on standby
List pgsql-hackers
Hello, Peter.

Thanks a lot for your comments.
There are some mine thoughts, related to the “masked bits” solution and your comments:

> During recovery, we will probably always have to consider the
> possibility that LP_DEAD bits that get set on the primary may be
> received by a replica through some implementation detail (e.g. LP_DEAD
> bits are set in FPIs we replay, maybe even some other thing that
> neither of us have thought of).

It is fine to receive a page to the standby from any source: `btpo_flags` should have some kind “LP_DEAD safe for standby” bit set to allow new bits to be set and old - read.

> We can't really mask LP_DEAD bits from
> the primary in recovery anyway, because of stuff like page-level
> checksums. I suspect that it just isn't useful to fight against that.

As far as I can see - there is no problem here. Checksums already differ for both heap and index pages on standby and primary. Checksums are calculated before the page is written to the disk (not after applying FPI). So, the masking page during *applying* the FPI is semantically the same as setting a bit in it 1 nanosecond after.

And `btree_mask` (and other mask functions) already used for consistency checks to exclude LP_DEAD.

> Plus you'll introduce new overhead for this process during replay,
> which creates significant overhead -- often most leaf pages have some
> LP_DEAD bits set during recovery.

I hope it is not big enough, because FPIs are not too frequent + positive effect will easily overcome additional CPU cycles of `btree_mask` (and the page is already in CPU cache at the moment).

> I don't like that idea. Apart from what I said already, you're
> assuming that setting LP_DEAD bits in indexes on the primary won't
> eventually have some value on the standby after it is promoted and can
> accept writes -- they really do have meaning and value on standbys.

I think it is fine to lose part of LP_DEAD on promotion (which can be received only by FPI in practice). They will be set on the first scan anyway. Also, bits set by standby may be used by newly promoted primary if we honor OldestXmin of the previous primary while setting it (just need to add OldestXmin in xl_running_xacts and include it into dead-horizon on standby).

> Why shouldn't this break page-level checksums (or wal_log_hints) in
> some way? What about pg_rewind, some eventual implementation of
> incremental backups, etc? I suspect that it will be necessary to
> invent some entirely new concept that is like a hint bit, but works on
> standbys (without breaking anything else).

As I said before - applying the mask on *standby* will not break any checksums - because the page is still dirty after that (and it is even possible to call `PageSetChecksumInplace` for an additional paranoia). Actual checksums on standby and primary already have different values (and, probably, in the most of the pages because LP_DEAD and “classic” hint bits).

> If you invent some entirely new category of standby-only hint bit at a
> level below the access method code, then you can use it inside access
> method code such as nbtree. Maybe you don't have to play games with
> minRecoveryPoint in code like the "if (RecoveryInProgress())" path
> from the XLogNeedsFlush() function. Maybe you can do some kind of
> rudimentary "masking" for the in recovery case at the point that we
> *write out* a buffer (*not* at the point hint bits are initially set)
> -- maybe this could happen near to or inside FlushBuffer(), and maybe
> only when checksums are enabled? I'm unsure.

Not sure I was able to understand your idea here, sorry.

> The difference may be subtle, but it's important here -- it justifies
> inventing a whole new type of LP_DEAD-style status bit that gets set
> only on standbys. Even today, heap tuples can have hint bits
> "independently" set on standbys, subject to certain limitations needed
> to avoid breaking things like data page checksums

Yes, and I see three major ways to implement it in the current infrastructure:

1) Use LP_REDIRECT (or other free value) instead of LP_DEAD on standby
2) Use LP_DEAD on standby, but involve some kind of recovery conflicts (like here - https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com )
3) Mask index FPI during a replay on hot standby + mark page as “primary LP_DEAD free” in btpo_flags

Of course, each variant requires some special additional things to keep everything safe.

As I see in SetHintsBits limitations are related to XLogNeedsFlush (check of minRecoveryPoint in case of standby).

> Conclusion: The whole minRecoveryPoint question that you're trying to
> answer to improve things for your patch is just the wrong question.
> Because LP_DEAD bits in indexes are not *true* "hint bits". Maybe it
> would be useful to set "true hint bits" on standbys earlier, and maybe
> thinking about minRecoveryPoint would help with that problem, but that
> doesn't help your index-related patch -- because indexes simply don't
> have true hint bits.

Attention to minRecoveryPoint is required because of possible situation described here - https://www.postgresql.org/message-id/flat/CANtu0ojwFcSQpyCxrGxJuLVTnOBSSrzKuF3cB_yCk0U-X-wpGw%40mail.gmail.com#4d8ef8754b18c5e35146ed589b25bf27
The source of the potential problem - is the fact what the setting of LP_DEAD does not change page LSN and it could cause consistency issues during crash recovery.

Thanks,
Michail.

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Custom compression methods
Next
From: David Rowley
Date:
Subject: Re: [sqlsmith] Failed assertion during partition pruning