Re: Thoughts on "killed tuples" index hint bits support on standby - Mailing list pgsql-hackers

From Michail Nikolaev
Subject Re: Thoughts on "killed tuples" index hint bits support on standby
Date
Msg-id CANtu0ojmkN_6P7CQWsZ=uEgeFnSmpCiqCxyYaHnhYpTZHj7Ubw@mail.gmail.com
Whole thread Raw
In response to Re: Thoughts on "killed tuples" index hint bits support on standby  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Thoughts on "killed tuples" index hint bits support on standby
List pgsql-hackers
Hello, Peter.

Thanks for your feedback.

> Attached is a very rough POC patch of my own, which makes item
> deletion occur "non-opportunistically" in unique indexes. The idea is
> that we exploit the uniqueness property of unique indexes to identify
> "version churn" from non-HOT updates. If any single value on a leaf
> page has several duplicates, then there is a good chance that we can
> safely delete some of them. It's worth going to the heap to check
> whether that's safe selectively, at the point where we'd usually have
> to split the page. We only have to free one or two items to avoid
> splitting the page. If we can avoid splitting the page immediately, we
> may well avoid splitting it indefinitely, or forever.

Yes, it is a brilliant idea to use uniqueness to avoid bloating the index. I am
not able to understand all the code now, but I’ll check the patch in more
detail later.

> This seems fairly relevant to what you're doing. It makes almost all
> index cleanup occur using the new delete infrastructure for some of
> the most interesting workloads where deletion takes place in client
> backends. In practice, a standby will almost be in the same position
> as the primary in a workload that this approach really helps with,
> since setting the LP_DEAD bit itself doesn't really need to happen (we
> can go straight to deleting the items in the new deletion path).

> This is probably
> not limited to the special unique index case that my patch focuses on
> -- we can probably push this general approach forward in a number of
> different ways. I just started with unique indexes because that seemed
> most promising. I have only worked on the project for a few days. I
> don't really know how it will evolve.

Yes, it is relevant, but I think it is «located in a different plane» and
complement each other. Because most of the indexes are not unique these days
and most of the standbys (and even primaries) have long snapshots (up to
minutes, hours) – so, multiple versions of index records are still required for
them. Even if we could avoid multiple versions somehow - it could lead to a very
high rate of query cancelations on standby.

> To address the questions you've asked: I don't really like the idea of
> introducing new rules around tuple visibility and WAL logging to set
> more LP_DEAD bits like this at all. It seems very complicated.

I don’t think it is too complicated. I have polished the idea a little and now
it looks even elegant for me :) I’ll try to explain the concept briefly (there
are no new visibility rules or changes to set more LP_DEAD bits than now –
everything is based on well-tested mechanics):

1) There is some kind of horizon of xmin values primary pushes to a standby
currently. All standby’s snapshots are required to satisfice this horizon to
access heap and indexes. This is done by *ResolveRecoveryConflictWithSnapshot*
and corresponding WAL records (for example -XLOG_BTREE_DELETE).

2) We could introduce a new WAL record (named XLOG_INDEX_HINT in the patch for
now) to define a horizon of xmin required for standby’s snapshot to use LP_DEAD
bits in the indexes.

3) Master sends XLOG_INDEX_HINT in case it sets LP_DEAD bit on the index page
(but before possible FPW caused by hints) by calling *LogIndexHintIfNeeded*. It
is required to send such a record only if the new xmin value is greater than
one send before. I made tests to estimate the amount of new WAL – it is really
small (especially compared to FPW writes done because of LP_DEAD bit set).

4) New XLOG_INDEX_HINT contains only a database id and value of
*latestIndexHintXid* (new horizon position). For simplicity, the primary could
set just set horizon to *RecentGlobalXmin*. But for now in the patch horizon
value extracted from heap in *HeapTupleIsSurelyDead* to reduce the number of
XLOG_INDEX_HINT records even more).


5) There is a new field in PGPROC structure - *indexIgnoreKilledTuples*. If it
is set to true – standby queries are going to use LP_DEAD bits in index scans.
In such a case snapshot is required to satisfice new LP_DEAD-horizon pushed by
XLOG_INDEX_HINT records. It is done by the same mechanism as used for heap -
*ResolveRecoveryConflictWithSnapshot*.

6) The major thing here – it is safe to set *indexIgnoreKilledTuples* to both
‘true’ and ‘false’ from the perspective of correctness. It is just some kind of
performance compromise – use LP_DEAD bits but be aware of XLOG_INDEX_HINT
horizon or vice versa.

7) What is the way to do the right decision about this compromise? It is pretty
simple – if hot_standby_feedback is on and primary confirmed our feedback is
received – then set *indexIgnoreKilledTuples* too ‘true’ – since while feedback
is working as expected – the query will be never canceled by XLOG_INDEX_HINT
horizon!

8) To support cascading standby setups (with a possible break of feedback
chain) – additional byte added to the ‘keep-alive’ message of the feedback
protocol.

9) So, at the moment we are safe to use LP_DEAD bits received from the
primary when we want to.

10) What is about setting LP_DEAD bits by standby? The main thing here -
*RecentGlobalXmin* on standby is always lower than XLOG_INDEX_HINT horizon by
definition – standby is always behind the primary. So, if something looks dead
on standby – it is definitely dead on the primary.

11) Even if:

* the primary changes vacuum_defer_cleanup_age
* standby restarted
* standby promoted to the primary
* base backup taken from standby
* standby is serving queries during recovery
– nothing could go wrong here.

Because *HeapTupleIsSurelyDead* (and index LP_DEAD as result) needs *HEAP* hint
bits to be already set at standby. So, the same code decides to set hint bits
in the heap (it is done already on standby for a long time) and in the index.

So, the only thing we pay – a few additional bytes of WAL and some additional
moderate code complexity. But the support of hint-bits on standby is a huge
advantage for many workloads. I was able to get more than 1000% performance
boost (and it is not surprising – index hint bits is just great optimization).
And it works for almost all index types out of the box.

Another major thing here – everything is based on old, well-tested mechanics:
query cancelation because of snapshot conflicts and setting heap hint bits on
standby.

Most of the patch – are technical changes to support new query cancelation
counters, new WAL record, new PGPROC field and so on. There are some places I
am not sure about yet, naming is bad – it is still POC.

Thanks,
Michail.

Attachment

pgsql-hackers by date:

Previous
From: tushar
Date:
Subject: Vacuum o/p with (full 1, parallel 0) option throwing an error
Next
From: David Steele
Date:
Subject: Re: WIP: WAL prefetch (another approach)