Re: GSoC on WAL-logging hash indexes - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: GSoC on WAL-logging hash indexes |
Date | |
Msg-id | 531873E8.6050802@vmware.com Whole thread Raw |
In response to | Re: GSoC on WAL-logging hash indexes (Greg Stark <stark@mit.edu>) |
Responses |
Re: GSoC on WAL-logging hash indexes
|
List | pgsql-hackers |
(de-CC'ing pgsql-advocacy) On 03/06/2014 04:03 AM, Greg Stark wrote: > On Mon, Mar 3, 2014 at 4:12 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Unfortunately, I don't believe that it's possible to do this easily >> today because of the way bucket splits are handled. I wrote about >> this previously here, with an idea for solving the problem: > > We could just tackle this in the same incomplete, buggy, way that > btrees tackled it for years until Heikki fixed them and the way gin > and gist still do I believe. Namely just emit xlog records for each > page individually and during replay remember when you have an > "incomplete split" and complain if recovery ends with any still > incomplete. That would be unfortunate to be adding new cases of this > just as Heikki and company are making progress eliminating the ones we > already had but that's surely better than having no recovery at all. Grmph. Indeed. Looking at Robert's idea from November: > I have thought about doing some work on this, although I don't know > when I'll ever have the time. As far as I can see, the basic problem > is that we need a better way of splitting buckets. Right now, > splitting a bucket means locking it, scanning the entire bucket chain > and moving ~1/2 the tuples to the new bucket, and then unlocking it. > Since this is a long operation, we have to use heavyweight locks to > protect the buckets, which is bad for concurrency. Since it involves > a modification to an arbitrarily large number of pages that has to be > atomic, it's not possible to WAL-log it sensibly -- and in fact, I > think this is really the major obstacle to being able to implementing > WAL-logging for hash indexes. I don't think it's necessary to improve concurrency just to get WAL-logging. Better concurrency is a worthy goal of its own, of course, but it's a separate concern. For crash safety, the key is to make sure you perform the whole operation in small atomic steps, and you have a way of knowing where to continue after crash (this is the same whether you do the cleanup immediately at the end of recovery, which I want to avoid, or lazily afterwards). But you can hold locks across those small atomic steps, to ensure concurrency-safety. > I think we could work around this problem as follows. Suppose we > have buckets 1..N and the split will create bucket N+1. We need a > flag that can be set and cleared on the metapage, call it > split-in-progress, and a flag that can optionally be set on particular > index tuples, call it moved-by-split. We will allow scans of all > buckets and insertions into all buckets while the split is in > progress, but (as now) we will not allow more than one split to be in > progress at the same time. Hmm, unless I'm reading the code wrong, it *does* allow more than one split to be in progress at the same time today. > [description of how to use the moved-by-split to allow scans to run concurrently with the split] I guess that would work, although you didn't actually describe how to continue a split after a crash. But it's a lot simpler if you don't also try to make it more concurrent: --- When you start splitting a bucket, first acquire a heavy-weight lock on the old and new buckets. Allocate the required number of pages, before changing anything on-disk, so that you can easily back out if you run out of disk space. So far, this is how splitting works today. Then you update the metapage to show that the bucket has been split and initialize the new bucket's primary page (as one atomic WAL-logged operation). Also mark the new bucket's primary page with a split-in-progress flag. That's used later by scans to detect if the split was interrupted. Now you scan the old bucket, and move all the tuples belonging to the new bucket. That needs to be done as a series of small atomic WAL-logged operations, each involving a small number of old and new pages (one WAL record for each moved tuple is the simplest, but you can do some batching for efficiency). After you're done, clear the split-in-progress flag in the new bucket's primary page (WAL-log that), and release the locks. In a scan, if the bucket you're about to scan has the split-in-progress flag set, that indicates that the split was interrupted by a crash. You won't see the flag as set if a concurrent split is in progress, because you will block on the lock it's holding on the bucket. If you see the flag as set, you share-lock and scan both buckets, the old and the new. If you see the split-in-progress flag in the bucket you're about to insert to, you don't need to do anything special. Just insert the tuple to the new bucket as normal. Before splitting the new bucket again, however, the previous split needs to be finished, or things will get complicated. To do that, acquire the locks on the old and the new bucket, and scan the old bucket for any remaining tuples that belong to the new bucket and move them, and finally clear the split-in-progress flag. --- This is similar to your description, as you scan both buckets if you see an interrupted split, but doesn't require the per-tuple moved-by-split flag, or waiting out scans. Also, I put the split-in-progress flag in the new bucket's primary page instead of the metapage. That allows multiple splits to be in progress at the same time. - Heikki
pgsql-hackers by date: