Latest plans for Utilities with HOT - Mailing list pgsql-hackers

From Simon Riggs
Subject Latest plans for Utilities with HOT
Date
Msg-id 1173107690.3760.1912.camel@silverbirch.site
Whole thread Raw
Responses Re: Latest plans for Utilities with HOT
List pgsql-hackers
Overview
--------
CREATE INDEX, CREATE INDEX CONCURRENTLY and VACUUM FULL all need some
adaptation to work correctly with HOT.

[This summary and proposal supercedes all previous proposals by me
regarding utilities with HOT]

The Problem
-----------

With HOT, CREATE INDEX may find tuples that are HEAP_ONLY_TUPLE, yet
have an index value(s) that differs from the indexable value of the
tuple version which is at the root of the hot chain.

e.g. we have a simple hot chain that looks like this on one page:

tup1(attr1=1, attr2=A)[ROOT] -> tup2(attr1=1, attr2=B)[HOT)

there is already an index(attr1), and we are trying to build an index on
attr2. Index(attr2) will to point to both tup1 and tup2. So by the end
of the build we need to have inserted an additional index pointer into
index(attr1) (yes, the other index) for tup2 and also broken the link
between tup1->tup2. This operation is known as chilling, or
de-HOT-ifying.

If we don't do this, we will have an index inconsistency. 

If we attempt to chill a tuple concurrently, we also get the possibility
of a concurrent index scan following multiple paths to the same tuple.
This effects all types of indexes, not just b-trees.

Solutions
---------

There are three basic solutions: LOCK,  UPDATE & WAIT and MARK

1. LOCK - Use some form of locking to prevent concurrent index scans
while we modify the various tuple versions and fiddle with the index
entries. This has some sub-options:
a) AccessExclusiveLock on main relation prevents all index accesses
b) Lock all of the indexes with an AccessExclusiveLock, in a consistent
order, so that no concurrent index scans are possible. This would be a
non-transactional lock, so could be dropped before end of transaction -
we're not locking the data we're just locking the physical access
mechanism.

2. UPDATE & WAIT - From Hannu - Perform cold-null-UPDATEs on each tuple,
which can happen concurrently. This can happen concurrently, but then
requires us to wait until we are older than all non-utility
transactions, so that these UPDATEs are visible to all.

3. MARK - From Heikki - verbatim:
One solution I thought of is to add another flag to heap tuples, 
CHILL_IN_PROGRESS. To chill a tuple, you would:

1. Mark heap tuple with CHILL_IN_PROGRESS
2. Insert missing index entries
3. Clear CHILL_IN_PROGRESS and HEAP_ONLY_TUPLE flags

Index scans would ignore tuples with CHILL_IN_PROGRESS and directly 
pointed to from the index. That way if we crash in the middle of step 2,
scans and updates would work normally after replay, as if the index 
entries weren't there. CREATE INDEX would have to fail if there's any 
CHILL_IN_PROGRESS tuples, because we wouldn't know which index entries 
need to be inserted; some might already be there. To clear the 
CHILL_IN_PROGRESS flag, a vacuum would be needed. Vacuum would remove 
all index entries for those tuples, but instead of removing the heap 
tuple in the 2nd scan it would just clear the CHILL_IN_PROGRESS flag, 
bringing us back to where we started.

Thoughts
--------

- CREATE INDEX CONCURRENTLY can use UPDATE & WAIT or MARK

- VACUUM FULL could use any of the solutions. It does currently hold
AccessExclusiveLock. UPDATE & WAIT would be very good.

- CREATE INDEX has the most problems, but MARK solves most of them.

Multiple index builds may need to chill different sets of tuples, but if
there is any overlap then the second build will fail when it sees the
CHILL_IN_PROGRESS flag that the other has set. So we can no longer
achieve multiple concurrent index builds without great care.

It might be possible to clean up old CHILL_IN_PROGRESS flags as we go,
but if we did that we'd need to make sure concurrent index scans never
pass each other, otherwise one index build would clean up the others
flags. So we have two choices:
a) we continue to allow concurrent index builds, but we throw an ERROR
when an index build finds a CHILL_IN_PROGRESS flag
b) we clean up CHILL_IN_PROGRESS flags as we go, but we give up the
ability to perform concurrent index builds

Proposal
--------

- CREATE INDEX can use the MARK solution as part of the first scan (cos
there's only one), setting flags called HEAP_CHILL_IN_PROGRESS.

As the index build scan proceeds we insert records directly into each
index using the standard WAL-logged insertion path. No spool file is
built up while we do this. Setting HEAP_CHILL_IN_PROGRESS must must also
write WAL records and dirty the block, since we might fail after we have
performed an index insert but before we remove the mark. This could mean
we'd end up writing lots of database blocks when we do CREATE INDEX.
(Note here that we use WAL to chill, even though the main index build
may still skip writing WAL as it does now, if archive_command is not
set).

Note we might fail to build an index because of a uniqueness violation
that occurs *during* the tuplesort run. So we must mark the tuple,
insert the indexes, clean the tuple and then offer the tuple to the
index callback.

We pick option a): continue to allow concurrent index builds, but we
throw an ERROR when an index build finds a HEAP_CHILL_IN_PROGRESS flag.
That's the optimistic approach. Building indexes concurrently only
usually happens after a restore/large load, which will be when there are
no HOT tuples or flags set anyway.

This means we also need to modify the VACUUM process to clean up after a
failed index build. VACUUM will build a second array of tids that can be
used to perform removal of HEAP_CHILL_IN_PROGRESS during
lazy_scan_index() [[perhaps with uncool_tuple()? :-) ]]. We use a second
array because we need only remove HEAP_CHILL_IN_PROGRESS flags when a
tuple will not be removed by the normal vacuum process, so there is by
definition no overlap between the two lists. We use an array of tids to
allow us to reuse all the existing code for heap/index cleaning.

- CREATE INDEX CONCURRENTLY can use the MARK solution as part of the
second scan. After the first scan, the new index will be visible to
other transactions and no new HOT tuples will be created that could
effect the new index. So chilling as part of the second scan will ensure
that we catch all HOT tuples.

- VACUUM FULL - The best solution, for now, is to make VACUUM FULL
perform a reindex on all indexes on the table. Chilling may require us
to modify considerably more index entries than previously. UPDATE & WAIT
would be very good, but probably should wait for the next release now,
since we now have changes to make to 4 utilities.

Are we cool with that?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




pgsql-hackers by date:

Previous
From: NikhilS
Date:
Subject: Re: PrivateRefCount (for 8.3)
Next
From: Tom Lane
Date:
Subject: Re: Aggressive freezing in lazy-vacuum