Heap WARM Tuples - Design Draft - Mailing list pgsql-hackers

From Pavan Deolasee
Subject Heap WARM Tuples - Design Draft
Date
Msg-id CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd+buUkJ4iDPw@mail.gmail.com
Whole thread Raw
Responses Re: Heap WARM Tuples - Design Draft  (Andres Freund <andres@anarazel.de>)
Re: Heap WARM Tuples - Design Draft  (Bruce Momjian <bruce@momjian.us>)
Re: Heap WARM Tuples - Design Draft  (Claudio Freire <klaussfreire@gmail.com>)
Re: Heap WARM Tuples - Design Draft  (Claudio Freire <klaussfreire@gmail.com>)
List pgsql-hackers
Write Amplification Reduction Method (WARM)
====================================

A few years back, we developed HOT to address the problem associated with MVCC and frequent updates and it has served us very well. But in the light of Uber's recent technical blog highlighting some of the problems that are still remaining, especially for workloads where HOT does not help to the full extent, Simon, myself and others at 2ndQuadrant discussed several ideas and what I present below is an outcome of that. This is not to take away credit from anybody else. Others may have thought about similar ideas, but I haven’t seen any concrete proposal so far.

At present, WARM looks to be a better idea than LITE, though we have that as a backstop also (says Simon).

Background
==========

As a background, HOT primarily helps in two ways.

1. It avoids additional index entries when a tuple is updated in a way such that no index columns are changed. Before we implemented HOT, every update would generate a new index entry in every index on the table. This not only resulted in bigger and bloater indexes, but also made vacuuming difficult because heap tuples can't be removed without first removing the corresponding index entries.  HOT made it possible to avoid additional index entries whenever possible and ensured most dead heap tuples can be removed at a page level.

2. HOT also helped other cases such as DELETE and non-HOT updates by reducing such heap tuples to a DEAD line pointer stub, until a regular vacuum happens. But the space consumed by the heap tuple is freed and returned to the FSM for future utilization.

Given that HOT was going to change the very core of MVCC and how tuples are stored and accessed, we made a decision to keep things simple and two important conditions were spelled out for doing HOT update.

1. There must be enough space in the same page to store the new version of the tuple.
2. None of the index columns must be updated.

While it was generally easy to satisfy the first condition by either leaving some free space in heap pages or system would anyways come to a stable state after initial few updates. The second condition is somewhat harder to control. Those who are aware would carefully design their schema and choose indexes so that HOT conditions are met. But that may not be possible for every workload, as seen from the Uber example. But many others may have faced similar situations.

This proposal is to relax the second condition, so it’s not quite HOT, just WARM.

Basic Idea
========

The basic idea is to allow updates that change an index column, without requiring every index on the table to cause a new index entry. We still require the first HOT condition i.e. the heap page must have sufficient free space to store the new version of the tuple.

Indexes whose values do not change do not require new index pointers. Only the index whose key is being changed will need a new index entry. The new index entry will be set to the CTID of the root line pointer.

This method succeeds in reducing the write amplification, but causes other issues which also need to be solved. WARM breaks the invariant that all tuples in a HOT chain have the same index values and so an IndexScan would need to re-check the index scan conditions against the visible tuple returned from heap_hot_search(). We must still check visibility, so we only need to re-check scan conditions on that one tuple version.
  
We don’t want to force a recheck for every index fetch because that will slow everything down. So we would like a simple and efficient way of knowing about the existence of a WARM tuple in a chain and do a recheck in only those cases, ideally O(1). Having a HOT chain contain a WARM tuple is discussed below as being a “WARM chain”, implying it needs re-check.

We could solve this by either 1) marking both old and new index tuples as "recheck-required" or 2) marking the HOT chain on the heap as "recheck-required" such that any tuple returned from the chain requires a recheck. 

The advantage of doing 1) is that the recheck is required only for the compromised index.
Note that in presence of multiple index keys pointing to the same root line pointer, the chain needs re-check for both the old as well as the new index key. The old index key must not fetch the new tuple(s) and the new index key must not fetch the old tuple(s). So irrespective of whether an old or a new tuple is being returned, the caller must know about the WARM tuples in the chains. So marking the index would require us to mark both old and new index tuples as requiring re-check. That requires an additional index scan to locate the old row and then an additional write to force it to re-check, which is algorithmically O(NlogN) on table size.

Marking the heap, (2) is O(1) per updated index, so seems better. But since we don't know with respect to which index or indexes the chain is broken, all index scans must do a recheck for a tuple returned from a WARM chain.

How do we mark a chain as WARM? How do we clear the flag? There are a few possible approaches:

1. Mark the first WARM tuple which causes HOT chain to become WARM with a special HEAP_RECHECK_REQUIRED flag (this flag must then be carried over to any further additions to the chain. Then whenever a tuple if returned from a chain, we must look forward into the chain for WARM tuple and let the caller know about potential breakage. This seems simple but would require us to follow HOT chain every time to check if it’s HOT or WARM.

2. Mark the root line pointer (or the root tuple) with a special HEAP_RECHECK_REQUIRED flag to tell us about the presence of a WARM tuple in the chain. Since all indexes point to the root line pointer, it should be enough to just mark the root line pointer (or tuple) with this flag. 

3. We could also adopt a hybrid approach and mark the new index tuple and new heap tuple with HEAP_RECHECK_REQUIRED flag and presence of either of the flag in either tuple forces recheck.

Approach 2 seems more reasonable and simple. 

There are only 2 bits for lp_flags and all combinations are already used. But for LP_REDIRECT root line pointer, we could use the lp_len field to store this special flag, which is not used for LP_REDIRECT line pointers. So we are able to mark the root line pointer.

Let me explain the idea in more detail using an example and technique 2)

CREATE TABLE test (col1 int, col2 int, col3 int, col4 text);
CREATE INDEX testindx_col1 ON test (col1);
CREATE INDEX testindx_col2 ON test (col2);
CREATE INDEX testindx_col3 ON test (col3);

INSERT INTO test VALUES (1, 11, 111, 'foo');

Let’s assume a single heap tuple and three indexes on three different columns. Lets assume that the heap tuple has CTID (0,1) at this point. At this point, each index has exactly one index entry pointing to CTID (0,1)

testindx_col1: (1)    ===> (0,1)
testindx_col2: (11)  ===> (0,1)
testindx_col3: (111) ===> (0,1)

Now if a transaction T1 UPDATEs the table such as, "UPDATE test SET col1 = 2". This does not satisfy the HOT property since index column 'col1' is being updated. But as per WARM algorithm, since only testindx_col1's index key has changed, we insert a new index tuple only in testindx_col1. Say the new heap tuple has CTID (0,2). So testindx_col1 will have a new index tuple with key (2), but still pointing to CTID (0,1)

testindx_col1: (1)    ===> (0,1)
                       (2)   ===> (0,1)
testindx_col2: (11)  ===> (0,1)
testindx_col3: (111) ===> (0,1)

The heap at this point has two tuples, linked by CTID chain. 
(0,1)* ---> (0,2)

Both these tuples are reachable via all indexes, starting at the root (0,1). A transaction which started before T1 should see (0,1) and those started after T1 should see (0,2). The MVCC visibility will ensure that the correct tuple is returned. But if a new transaction T2 is looking up (col1 = 1), while MVCC will return (0,2), the tuple does not satisfy the index condition. Similarly, for an old transaction T0 looking up (col1 = 2), MVCC will return (0,1) but that tuple does not satisfy the index condition either. IndexScan will ensure that tuples are rechecked in such instances.

If T3 later updates col2 and T4 further updates col3,
T3: UPDATE test SET col2 = 12;
T4: UPDATE test SET col3 = 112;

The heap will look like:
(0,1)* ---> (0,2) ---> (0,3) ---> (0,4)

And the index pointers will be:

testindx_col1: (1)   ===> (0,1)
                       (2)   ===> (0,1)
testindx_col2: (11)  ===> (0,1)
                       (12)  ===> (0,1)
testindx_col3: (111) ===> (0,1)
                       (112) ===> (0,1)

The (*) here indicates that there may exist a WARM tuple in the chain and index recheck is required. Our algorithm to recheck heap tuple when it’s returned from a WARM chain, should ensure that only valid and matching tuples are returned by the index scans.

Marking index tuples DEAD
=====================
When a WARM chain becomes DEAD and replaced by just a LP_DEAD stub, referencing index tuples can be marked as killed, and removed by VACUUM.

HOT Pruning
==========
No change is required for HOT pruning other than to copy the HEAP_CHECK_REQUIRED flag to the line pointer when root heap tuple is pruned and the root line pointer is turned into redirected line pointer. Except that, HOT prune can do its job i.e. remove dead tuples from the HOT or WARM chains and if necessary turn root line pointers into redirect or dead line pointers.

Clearing the flags
==============
We can't clear the flag until stale index pointers go, even if the chain is all HOT again. In other words, the flag must remain set until there exists a wrong index key pointing to the root tuple.

One idea is to somehow do this as part of the vacuum where we collect root line pointers of  WARM chains during the first phase of heap scan, check the indexes for all such tuples (may be combined with index vacuum) and then clear the heap flags during the second pass, unless new tuples are added to the WARM chain. We can detect that by checking that all tuples in the WARM chain still have XID less than the OldestXmin that VACUUM is using.

We don’t really need to count all references to a root of WARM chain. What we need to know is if there exists an index which has more than one pointer to the root tuple. If yes, the flag cannot be cleared. ISTM that even two bits per htid will be enough to track this state. Other ideas welcome. 

WAL Logging
===========

It’s important to ensure that the flag is set when it is absolutely necessary, while having false positives is not a problem. We might do a little wasteful work if the flag is incorrectly set. Since flag will be set only during heap_update() and the operation is already WAL logged, this can be piggybacked with the heap_update WAL record. Similarly, when a root tuple is pruned to a redirect line pointer, the operation is already WAL logged and we can piggyback setting of line pointer flag with that WAL record.

Flag clearing need not be WAL logged, unless we can piggyback that to some existing WAL logging.

Vacuum
========

Vacuum should continue to do what it does today i.e. prune HOT chains and collect LP_DEAD line pointers and remove associated index entries and then mark line pointers as LP_UNUSED. It could be extended to do additional work of clearing the flags to avoid rechecks when they are not any more required as discussed above.

Testing
=====

It will be straightforward to have a developer-level index_recheck GUC parameter.

Index_recheck = off (default) | on

For testing, compare results between two queries, one with index_recheck=on and one with index_recheck=off. If this finds any difference we’ll know we have a bug.

We can use this in testing the feature in development and also use it in the field if we suspect problems.

Comments/suggestions?

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Oddity in EXPLAIN for foreign/custom join pushdown plans
Next
From: Ildar Musin
Date:
Subject: Unused argument in PinBuffer function