Thread: Concurrent calls of _hash_getnewbuf()

Concurrent calls of _hash_getnewbuf()

From
Antonin Houska
Date:
When doing my experiments with bucket split ([1]), I noticed a comment that
_hash_getnewbuf should not be called concurrently. However, there's no
synchronization of calls from _hash_splitbucket in HEAD. I could reproduce
such concurrent calls using gdb (multiple bucket splits in progress at a
time).

When the function is called from _hash_getovflpage, content lock of metapage
buffer seems to be (mis)used to synchronize the calls:

    /*
     * Fetch the page with _hash_getnewbuf to ensure smgr's idea of the
     * relation length stays in sync with ours.  XXX It's annoying to do this
     * with metapage write lock held; would be better to use a lock that
     * doesn't block incoming searches.
     */
    newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);

I think it'd also be the easiest fix for _hash_splitbucket. Or should a
separate ("regular") lock be introduced and used and used in both cases?


[1] http://www.postgresql.org/message-id/32423.1427413442@localhost

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
new file mode 100644
index 46c6c96..25c1dd1
*** a/src/backend/access/hash/hashpage.c
--- b/src/backend/access/hash/hashpage.c
*************** _hash_splitbucket(Relation rel,
*** 765,771 ****
--- 765,773 ----
      oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);

      nblkno = start_nblkno;
+     _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
      nbuf = _hash_getnewbuf(rel, nblkno, MAIN_FORKNUM);
+     _hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
      npage = BufferGetPage(nbuf);

      /* initialize the new bucket's primary page */

Re: Concurrent calls of _hash_getnewbuf()

From
Tom Lane
Date:
Antonin Houska <ah@cybertec.at> writes:
> When doing my experiments with bucket split ([1]), I noticed a comment that
> _hash_getnewbuf should not be called concurrently. However, there's no
> synchronization of calls from _hash_splitbucket in HEAD. I could reproduce
> such concurrent calls using gdb (multiple bucket splits in progress at a
> time).

Ugh.

> [ locking the metapage around the _hash_getnewbuf call ]
> I think it'd also be the easiest fix for _hash_splitbucket. Or should a
> separate ("regular") lock be introduced and used and used in both cases?

That doesn't really fix the problem: once we release lock on the metapage,
if a new split starts then it would try to create a bucket at the next
higher page number, so that the _hash_getnewbuf() calls might occur "out
of order", leading to failures of the crosschecks therein.

I think the only simple fix here is to rearrange _hash_splitbucket's API
so that the _hash_getnewbuf() call is done before we drop the metapage
lock at all.  Not sure offhand if the best way is to move that call
into _hash_expandtable(), or to change the API definition so that
_hash_splitbucket is responsible for dropping the metapage lock at the
right time.

Obviously, this whole business of needing the metapage lock to add pages
is nasty.  I wonder if we could use the relation extension lock instead
to serialize those operations, rather than requiring a metapage buffer
lock.  But that would be a much more invasive thing.  In any case, the
current state of affairs is that the relation physical length is tightly
tied to the bucket mapping defined by the metapage contents, and we have
to change both of those together.

Actually ... one of the other things that's not being accounted for here
is the possibility of ENOSPC while adding the new page.  Really, we want
to attempt the extension *before* we modify the bucket mapping.  So that
seems to argue for moving the _hash_getnewbuf call into _hash_expandtable,
where it could be done just before we scribble on the metapage.  OTOH,
we're not any less screwed if we get ENOSPC while trying to add an
overflow page later on :-(, since future readers would assume the bucket
split's been performed.
        regards, tom lane