Re: Thoughts on "killed tuples" index hint bits support on standby - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Thoughts on "killed tuples" index hint bits support on standby
Date
Msg-id CAH2-Wzm1aY190kD3Z=w3y8V3RnOyi16uF1Xik9q-BuemOo6yFA@mail.gmail.com
Whole thread Raw
In response to Re: Thoughts on "killed tuples" index hint bits support on standby  (Andres Freund <andres@anarazel.de>)
Responses Re: Thoughts on "killed tuples" index hint bits support on standby  (Michail Nikolaev <michail.nikolaev@gmail.com>)
List pgsql-hackers
On Thu, Jan 16, 2020 at 9:54 AM Andres Freund <andres@anarazel.de> wrote:
> I don't think we can rely on hot_standby_feedback at all. We can to
> avoid unnecessary cancellations, etc, and even assume it's setup up
> reasonably for some configurations, but there always needs to be an
> independent correctness backstop.

+1

> I'm less clear on how we can make sure that we can *rely* on LP_DEAD to
> skip over entries during scans, however. The bits set as described above
> would be safe, but we also can see LP_DEAD set by the primary (and even
> upstream cascading standbys at least in case of new base backups taken
> from them), due to them being not being WAL logged. As we don't WAL log,
> there is no conflict associated with the LP_DEADs being set.  My gut
> feeling is that it's going to be very hard to get around this, without
> adding WAL logging for _bt_killitems et al (including an interface for
> kill_prior_tuple to report the used horizon to the index).

I agree.

What about calling _bt_vacuum_one_page() more often than strictly
necessary to avoid a page split on the primary? The B-Tree
deduplication patch sometimes does that, albeit for completely
unrelated reasons. (We don't want to have to unset an LP_DEAD bit in
the case when a new/incoming duplicate tuple has a TID that overlaps
with the posting list range of some existing duplicate posting list
tuple.)

I have no idea how you'd determine that it was time to call
_bt_vacuum_one_page(). Seems worth considering.

> I'm wondering if we could recycle BTPageOpaqueData.xact to store the
> horizon applying to killed tuples on the page. We don't need to store
> the level for leaf pages, because we have BTP_LEAF, so we could make
> space for that (potentially signalled by a new BTP flag).  Obviously we
> have to be careful with storing xids in the index, due to potential
> wraparound danger - but I think such page would have to be vacuumed
> anyway, before a potential wraparound.

You would think that, but unfortunately we don't currently do it that
way. We store XIDs in deleted leaf pages that can sometimes be missed
until the next wraparound.

We need to do something like commit
6655a7299d835dea9e8e0ba69cc5284611b96f29, but for B-Tree. It's
somewhere on my TODO list.

> I think we could safely unset
> the xid during nbtree single page cleanup, and vacuum, by making sure no
> LP_DEAD entries survive, and by including the horizon in the generated
> WAL record.
>
> That however still doesn't really fully allow us to set LP_DEAD on
> standbys, however - but it'd allow us to take the primary's LP_DEADs
> into account on a standby. I think we'd run into torn page issues, if we
> were to do so without WAL logging, because we'd rely on the LP_DEAD bits
> and BTPageOpaqueData.xact to be in sync.  I *think* we might be safe to
> do so *iff* the page's LSN indicates that there has been a WAL record
> covering it since the last redo location.

That sounds like a huge mess.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Patch to document base64 encoding
Next
From: Robert Haas
Date:
Subject: Re: making the backend's json parser work in frontend code