Thread: Preliminary notes about hash index concurrency (long)

Preliminary notes about hash index concurrency (long)

From
Tom Lane
Date:
I've been looking at fixing the problem reported a few days ago whereby
a bucket split in a hash index messes up the state of concurrent scans
of the index, possibly causing some tuples to be missed by the scans.
AFAICS the only way to fix this is to prevent such a concurrent split.
Accordingly, I've been trying to redesign the hash index locking
mechanisms to make that possible, and while I'm at it eliminate the
various internal deadlock risks that presently exist in hash indexes.
Attached are some design notes --- any comments?
        regards, tom lane


Lock definitions
----------------

We use both lmgr locks ("heavyweight" locks) and buffer context locks
(LWLocks) to control access to a hash index.  lmgr locks are needed for
long-term locking since there is a (small) risk of deadlock, which we must
be able to detect.  Buffer context locks are used for short-term access
control to individual pages of the index.

We define the following lmgr locks for a hash index:

LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
mapping.  A process attempting to enlarge the hash table by splitting a
bucket must exclusive-lock this lock before modifying the metapage data
representing the mapping.  Processes intending to access a particular
bucket must share-lock this lock until they have acquired lock on the
correct target bucket.

LockPage(rel, page), where page is the page number of a hash bucket page,
represents the right to split or compact an individual bucket.  A process
splitting a bucket must exclusive-lock both old and new halves of the
bucket until it is done.  A process doing VACUUM must exclusive-lock the
bucket it is currently purging tuples from.  Processes doing scans or
insertions must share-lock the bucket they are scanning or inserting into.
(It is okay to allow concurrent scans and insertions.)

The lmgr lock IDs corresponding to overflow pages are currently unused.
These are available for possible future refinements.

Note that these lock definitions are conceptually distinct from any sort
of lock on the pages whose numbers they share.  A process must also obtain
read or write buffer lock on the metapage or bucket page before accessing
said page.

Processes performing hash index scans must hold share lock on the bucket
they are scanning throughout the scan.  This seems to be essential, since
there is no reasonable way for a scan to cope with its bucket being split
underneath it.  This creates a possibility of deadlock external to the
hash index code, since a process holding one of these locks could block
waiting for an unrelated lock held by another process.  If that process
then does something that requires exclusive lock on the bucket, we have
deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
can be detected and recovered from.  This also forces the page-zero lock
to be an lmgr lock, because as we'll see below it is held while attempting
to acquire a bucket lock, and so it could also participate in a deadlock.

Processes must obtain read (share) buffer context lock on any hash index
page while reading it, and write (exclusive) lock while modifying it.
To prevent deadlock we enforce these coding rules: no buffer lock may be
held long term (across index AM calls), nor may any buffer lock be held
while waiting for an lmgr lock, nor may more than one buffer lock
be held at a time by any one process.  (The third restriction is probably
stronger than necessary, but it makes the proof of no deadlock obvious.)


Pseudocode algorithms
---------------------

The operations we need to support are: readers scanning the index for
entries of a particular hash code (which by definition are all in the same
bucket); insertion of a new tuple into the correct bucket; enlarging the
hash table by splitting an existing bucket; and garbage collection
(deletion of dead tuples and compaction of buckets).  Bucket splitting is
done at conclusion of any insertion that leaves the hash table more full
than the target load factor, but it is convenient to consider it as an
independent operation.  Note that we do not have a bucket-merge operation
--- the number of buckets never shrinks.  Insertion, splitting, and
garbage collection may all need access to freelist management, which keeps
track of available overflow pages.

The reader algorithm is:
share-lock page 0 (to prevent active split)read/sharelock meta pagecompute bucket number for target hash keyrelease
metapageshare-lock bucket page (to prevent split/compact of this bucket)release page 0 share-lock
 
-- then, per read request:read/sharelock current page of bucket    step to next page if necessary (no chaining of
locks)gettuplerelease current page
 
-- at scan shutdown:release bucket share-lock

By holding the page-zero lock until lock on the target bucket is obtained,
the reader ensures that the target bucket calculation is valid (otherwise
the bucket might be split before the reader arrives at it, and the target
entries might go into the new bucket).  Holding the bucket sharelock for
the remainder of the scan prevents the reader's current-tuple pointer from
being invalidated by other processes.  Notice though that the reader need
not prevent other buckets from being split or compacted.

The insertion algorithm is rather similar:
share-lock page 0 (to prevent active split)read/sharelock meta pagecompute bucket number for target hash keyrelease
metapageshare-lock bucket page (to prevent split/compact this bucket)release page 0 share-lock
 
-- (so far same as reader)read/exclusive-lock current page of bucketif full, release, read/exclusive-lock next page;
repeatas needed>> see below if no space in any page of bucketinsert tuplewrite/release current pagerelease bucket
share-lockread/exclusive-lockmeta pageincrement tuple count, decide if split neededwrite/release meta pagedone if no
splitneeded, else enter Split algorithm below
 

It is okay for an insertion to take place in a bucket that is being
actively scanned, because it does not change the position of any existing
item in the bucket, so scan states are not invalidated.  We only need the
short-term buffer locks to ensure that readers do not see a
partially-updated page.

It is clearly impossible for readers and inserters to deadlock, and in
fact this algorithm allows them a very high degree of concurrency.
(The exclusive metapage lock taken to update the tuple count is stronger
than necessary, since readers do not care about the tuple count, but the
lock is held for such a short time that this is probably not an issue.)

When an inserter cannot find space in any existing page of a bucket, it
must obtain an overflow page and add that page to the bucket's chain.
Details of that part of the algorithm appear later.

The page split algorithm is entered whenever an inserter observes that the
index is overfull (has a higher-than-wanted ratio of tuples to buckets).
The algorithm attempts, but does not necessarily succeed, to split one
existing bucket in two, thereby lowering the fill ratio:
exclusive-lock page 0 (assert the right to begin a split)read/exclusive-lock meta pagecheck split still neededif split
notneeded anymore, drop locks and exitdecide which bucket to splitAttempt to X-lock new bucket number (shouldn't fail,
but...)Attemptto X-lock old bucket number (definitely could fail)if above fail, drop locks and exitupdate meta page to
reflectnew number of bucketswrite/release meta pagerelease X-lock on page 0-- now, accesses to all other buckets can
proceed.Performactual split of bucket, moving tuples as needed>> see below about acquiring needed extra spaceRelease
X-locksof old and new buckets
 

Note the page zero and metapage locks are not held while the actual tuple
rearrangement is performed, so accesses to other buckets can proceed in
parallel; in fact, it's possible for multiple bucket splits to proceed
in parallel.

Split's attempt to X-lock the old bucket number could fail if another
process holds S-lock on it.  We do not want to wait if that happens, first
because we don't want to wait while holding the metapage exclusive-lock,
and second because it could very easily result in deadlock.  (The other
process might be out of the hash AM altogether, and could do something
that blocks on another lock this process holds; so even if the hash
algorithm itself is deadlock-free, a user-induced deadlock could occur.)
So, this is a conditional LockAcquire operation, and if it fails we just
abandon the attempt to split.  This is all right since the index is
overfull but perfectly functional.  Every subsequent inserter will try to
split, and eventually one will succeed.  If multiple inserters failed to
split, the index might still be overfull, but eventually, the index will
not be overfull and split attempts will stop.  (We could make a successful
splitter loop to see if the index is still overfull, but I think I prefer
distributing the split overhead across successive insertions.)

It may be wise to make the initial exclusive-lock-page-zero operation a
conditional one as well, although the odds of a deadlock failure are quite
low.  (AFAICS it could only deadlock against a VACUUM operation that is
trying to X-lock a bucket that the current process has a stopped indexscan
in.)

A problem is that if a split fails partway through (eg due to insufficient
disk space) the index is left corrupt.  The probability of that could be
made quite low if we grab a free page or two before we update the meta
page, but the only real solution is to treat a split as a WAL-loggable,
must-complete action.  I'm not planning to teach hash about WAL in this
go-round.

The fourth operation is garbage collection (bulk deletion):
next bucket := 0read/sharelock meta pagefetch current max bucket numberrelease meta pagewhile next bucket <= max bucket
do   Acquire X lock on target bucket    Scan and remove tuples, compact free space as needed    Release X lock    next
bucket++end loopexclusive-lock meta pagecheck if number of buckets changedif so, release lock and return to
for-each-bucketloopelse update metapage tuple countwrite/release meta page
 

Note that this is designed to allow concurrent splits.  If a split occurs,
tuples relocated into the new bucket will be visited twice by the scan,
but that does no harm.  (We must however be careful about the statistics
reported by the VACUUM operation.  What we can do is count the number of
tuples scanned, and believe this in preference to the stored tuple count
if the stored tuple count and number of buckets did *not* change at any
time during the scan.  This provides a way of correcting the stored tuple
count if it gets out of sync for some reason.  But if a split or insertion
does occur concurrently, the scan count is untrustworthy; instead,
subtract the number of tuples deleted from the stored tuple count and
use that.)

The exclusive lock request could deadlock in some strange scenarios, but
we can just error out without any great harm being done.


Free space management
---------------------

Free space management consists of two sub-algorithms, one for reserving
an overflow page to add to a bucket chain, and one for returning an empty
overflow page to the free pool.

Obtaining an overflow page:
read/exclusive-lock meta pagedetermine next bitmap page number; if none, exit looprelease meta page
lockread/exclusive-lockbitmap pagesearch for a free page (zero bit in bitmap)if found:    set bit in bitmap
write/releasebitmap page    read/exclusive-lock meta page    if first-free-bit value did not change,        update it
andwrite meta page    release meta page    return page numberelse (not found):release bitmap pageloop back to try next
bitmappage, if any
 
-- here when we have checked all bitmap pages; we hold meta excl. lockextend index to add another bitmap page; update
metainformationwrite/release meta pagereturn page number
 

It is slightly annoying to release and reacquire the metapage lock
multiple times, but it seems best to do it that way to minimize loss of
concurrency against processes just entering the index.  We don't want
to hold the metapage exclusive lock while reading in a bitmap page.
(We can at least avoid repeated buffer pin/unpin here.)

The portion of tuple insertion that calls the above subroutine looks
like this:
-- having determined that no space is free in the target bucket:remember last page of bucket, drop write lock on itcall
free-page-acquireroutinere-write-lock last page of bucketif it is not last anymore, step to the last pageupdate
(former)last page to point to new pagewrite-lock and initialize new page, with back link to former last pagewrite and
releaseformer last pageinsert tuple into new page-- etc.
 

Notice this handles the case where two concurrent inserters try to extend
the same bucket.  They will end up with a valid, though perhaps
space-inefficient, configuration: two overflow pages will be added to the
bucket, each containing one tuple.

The last part of this violates the rule about holding write lock on two
pages concurrently, but it should be okay to write-lock the previously
free page; there can be no other process holding lock on it.

Bucket splitting uses a similar algorithm if it has to extend the new
bucket, but it need not worry about concurrent extension since it has
exclusive lock on the new bucket.

Freeing an overflow page is done by garbage collection and by bucket
splitting (the old bucket may contain no-longer-needed overflow pages).
In both cases, the process holds exclusive lock on the containing bucket,
so need not worry about other accessors of pages in the bucket.  The
algorithm is:
delink overflow page from bucket chain(this requires read/update/write/release of fore and aft siblings)read/share-lock
metapagedetermine which bitmap page contains the free space bit for pagerelease meta pageread/exclusive-lock bitmap
pageupdatebitmap bitwrite/release bitmap pageif page number is less than what we saw as first-free-bit in
meta:read/exclusive-lockmeta pageif page number is still less than first-free-bit,    update first-free-bit field and
writemeta pagerelease meta page
 

We have to do it this way because we must clear the bitmap bit before
changing the first-free-bit field.

All the freespace operations should be called while holding no buffer
locks.  Since they need no lmgr locks, deadlock is not possible.


Other notes
-----------

All the shenanigans with locking prevent a split occurring while *another*
process is stopped in a given bucket.  They do not ensure that one of
our *own* backend's scans is not stopped in the bucket, because lmgr
doesn't consider a process's own locks to conflict.  So the Split
algorithm must check for that case separately before deciding it can go
ahead with the split.  VACUUM does not have this problem since nothing
else can be happening within the vacuuming backend.

Should we instead try to fix the state of any conflicting local scan?
Seems mighty ugly --- got to move the held bucket S-lock as well as lots
of other messiness.  For now, just punt and don't split.


Re: Preliminary notes about hash index concurrency (long)

From
Bruce Momjian
Date:
Tom Lane wrote:
> I've been looking at fixing the problem reported a few days ago whereby
> a bucket split in a hash index messes up the state of concurrent scans
> of the index, possibly causing some tuples to be missed by the scans.
> AFAICS the only way to fix this is to prevent such a concurrent split.
> Accordingly, I've been trying to redesign the hash index locking
> mechanisms to make that possible, and while I'm at it eliminate the
> various internal deadlock risks that presently exist in hash indexes.
> Attached are some design notes --- any comments?

Seems you are adding locking similar to what we already do in btree.

I know we have two sets of hash codes -- the one used for hash indexes,
and another used for hash joins and now aggregates and subqueries.  I
assume these changes are for hash indexes.  

I know someone reported a problem with the hash indexes (data loss,
serious)--- was that a new 7.4 but or something that has existed for a
long time?  When were you considering making these changes?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Preliminary notes about hash index concurrency (long)

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> I've been looking at fixing the problem reported a few days ago whereby
>> a bucket split in a hash index messes up the state of concurrent scans
>> of the index, possibly causing some tuples to be missed by the scans.

> Seems you are adding locking similar to what we already do in btree.

Not adding locking --- hash already has locking.  The problem is the
locking is wrong (both inadequate and deadlock prone :-().

> I know we have two sets of hash codes -- the one used for hash indexes,
> and another used for hash joins and now aggregates and subqueries.

There's only one set of hash computation routines now.  But you are
right that the issues under discussion only affect hash indexes, not
in-memory hash tables.

> I know someone reported a problem with the hash indexes (data loss,
> serious)--- was that a new 7.4 but or something that has existed for a
> long time?

AFAICT the bug's been there since Berkeley days.

> When were you considering making these changes?

Now.  We have three choices: fix it, ship 7.4 with a known data-loss
bug, or remove hash indexes.  Which do you like?
        regards, tom lane


Re: Preliminary notes about hash index concurrency (long)

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> If multiple inserters failed to split, the index might still be overfull,
> but eventually, the index will not be overfull and split attempts will stop.
> (We could make a successful splitter loop to see if the index is still
> overfull, but I think I prefer distributing the split overhead across
> successive insertions.)

If one backend is executing a query but the client has paused reading records,
is it possible the shared lock on the index bucket would be held for a long
time?

If so wouldn't it be possible for an arbitrarily large number of records to be
inserted while the lock is held, eventually causing the bucket to become
extremely large? Is there a maximum size at which the bucket split MUST
succeed or is it purely a performance issue that the buckets be reasonably
balanced?

-- 
greg



Re: Preliminary notes about hash index concurrency (long)

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> I've been looking at fixing the problem reported a few days ago whereby
> >> a bucket split in a hash index messes up the state of concurrent scans
> >> of the index, possibly causing some tuples to be missed by the scans.
> 
> > Seems you are adding locking similar to what we already do in btree.
> 
> Not adding locking --- hash already has locking.  The problem is the
> locking is wrong (both inadequate and deadlock prone :-().

Yea, I meant "adding locking sophistication similar to btree".  :-)

> > I know we have two sets of hash codes -- the one used for hash indexes,
> > and another used for hash joins and now aggregates and subqueries.
> 
> There's only one set of hash computation routines now.  But you are
> right that the issues under discussion only affect hash indexes, not
> in-memory hash tables.

Oh, yes, you merged them, but the index code is somehow separate for
locking issues (per-backend hashes don't have to lock anything).

> > I know someone reported a problem with the hash indexes (data loss,
> > serious)--- was that a new 7.4 but or something that has existed for a
> > long time?
> 
> AFAICT the bug's been there since Berkeley days.

Oh.

> > When were you considering making these changes?
> 
> Now.  We have three choices: fix it, ship 7.4 with a known data-loss
> bug, or remove hash indexes.  Which do you like?

Since Berkeley, huh?  Seems like a big change.  I would think the
logical solution would be to fix it in 7.5, but that leaves us with
shipping a hash that is now known to be buggy.  While it was always
buggy, we didn't know for sure until now.  We could disable hash indexes
in 7.4, but then re-enable them in 7.5 with the fix.  That seems kind of
strange becuase the current hash indexes would be moved to btree, but
then they would be have to be moved manually back to hash in 7.5.  Is
there a way to convert them to btree, but still have them dump as HASH
in pg_dump, so when they are moved to 7.5, they move back to hashes? 
That might be just too wierd.  If all the code changes are only in the
hash indexes, and they are known to be buggy, maybe we should just give
it a shot for 7.4 knowing it probably can't get worse than it already
is (but it could).

I know I am just shooting out ideas, but I am sure one of them still
stick with the group.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Preliminary notes about hash index concurrency (long)

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> If all the code changes are only in the
> hash indexes, and they are known to be buggy, maybe we should just give
> it a shot for 7.4 knowing it probably can't get worse than it already
> is (but it could).

That's basically my opinion.  It's unlikely to get more broken than it
already is.  I've already found several unrelated bugs while studying
the code :-(.  For starters, it doesn't come close to working correctly
when there's more than one overflow-page-bitmap page.  Since we've not
heard reports of problems, I doubt anyone's exercised it with large
numbers of overflow pages.
        regards, tom lane


Re: Preliminary notes about hash index concurrency (long)

From
Hannu Krosing
Date:
Tom Lane kirjutas E, 01.09.2003 kell 15:41:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I know someone reported a problem with the hash indexes (data loss,
> > serious)--- was that a new 7.4 but or something that has existed for a
> > long time?
> 
> AFAICT the bug's been there since Berkeley days.

One could check how BSDDB (http://www.sleepycat.com) handles these
issues. It is reported to have started as btree/hash index code
extracted from an early version of postgres, so perhaps there one could
at least get some ideas, though their locking / concurrency control are
probably much different.

--------------
Hannu



Re: Preliminary notes about hash index concurrency (long)

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> If multiple inserters failed to split, the index might still be overfull,
>> but eventually, the index will not be overfull and split attempts will stop.

> If one backend is executing a query but the client has paused reading records,
> is it possible the shared lock on the index bucket would be held for a long
> time?

Yes.

> If so wouldn't it be possible for an arbitrarily large number of records to be
> inserted while the lock is held, eventually causing the bucket to become
> extremely large?

Yes.

> Is there a maximum size at which the bucket split MUST succeed or is
> it purely a performance issue that the buckets be reasonably balanced?

AFAICS it's purely a performance issue.

Note also that a hash index will by definition have sucky performance on
large numbers of equal keys, so anyone who is using a hash index on such
a column deserves what they get.  Now you could possibly have this
worst-case scenario even on a column with well-scattered keys, but it
seems improbable.
        regards, tom lane