Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Robert Haas
Subject Re: BufferAlloc: don't take two simultaneous locks
Date
Msg-id CA+TgmoZRUPMHZfT4+53VCNsUMk-bq3cMomjpa-LP-84uBO6aQg@mail.gmail.com
Whole thread Raw
In response to Re: BufferAlloc: don't take two simultaneous locks  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BufferAlloc: don't take two simultaneous locks  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, Apr 14, 2022 at 10:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I agree that "just hope it doesn't overflow" is unacceptable.
> But couldn't you bound the number of extra entries as MaxBackends?

Yeah, possibly ... as long as it can't happen that an operation still
counts against the limit after it's failed due to an error or
something like that.

> FWIW, I have extremely strong doubts about whether this patch
> is safe at all.  This particular problem seems resolvable though.

Can you be any more specific?

This existing comment is surely in the running for terrible comment of the year:

         * To change the association of a valid buffer, we'll need to have
         * exclusive lock on both the old and new mapping partitions.

Anybody with a little bit of C knowledge will have no difficulty
gleaning from the code which follows that we are in fact acquiring
both buffer locks, but whoever wrote this (and I think it was a very
long time ago) did not feel it necessary to explain WHY we will need
to have an exclusive lock on both the old and new mapping partitions,
or more specifically, why we must hold both of those locks
simultaneously. That's unfortunate. It is clear that we need to hold
both locks at some point, just because the hash table is partitioned,
but it is not clear why we need to hold them both simultaneously.

It seems to me that whatever hazards exist must come from the fact
that the operation is no longer fully atomic. The existing code
acquires every relevant lock, then does the work, then releases locks.
Ergo, we don't have to worry about concurrency because there basically
can't be any. Stuff could be happening at the same time in other
partitions that are entirely unrelated to what we're doing, but at the
time we touch the two partitions we care about, we're the only one
touching them. Now, if we do as proposed here, we will acquire one
lock, release it, and then take the other lock, and that means that
some operations could overlap that can't overlap today. Whatever gets
broken must get broken because of that possible overlapping, because
in the absence of concurrency, the end state is the same either way.

So ... how could things get broken by having these operations overlap
each other? The possibility that we might run out of buffer mapping
entries is one concern. I guess there's also the question of whether
the collision handling is adequate: if we fail due to a collision and
handle that by putting the buffer on the free list, is that OK? And
what if we fail midway through and the buffer doesn't end up either on
the free list or in the buffer mapping table? I think maybe that's
impossible, but I'm not 100% sure that it's impossible, and I'm not
sure how bad it would be if it did happen. A permanent "leak" of a
buffer that resulted in it becoming permanently unusable would be bad,
for sure. But all of these issues seem relatively possible to avoid
with sufficiently good coding. My intuition is that the buffer mapping
table size limit is the nastiest of the problems, and if that's
resolvable then I'm not sure what else could be a hard blocker. I'm
not saying there isn't anything, just that I don't know what it might
be.

To put all this another way, suppose that we threw out the way we do
buffer allocation today and always allocated from the freelist. If the
freelist is found to be empty, the backend wanting a buffer has to do
some kind of clock sweep to populate the freelist with >=1 buffers,
and then try again. I don't think that would be performant or fair,
because it would probably happen frequently that a buffer some backend
had just added to the free list got stolen by some other backend, but
I think it would be safe, because we already put buffers on the
freelist when relations or databases are dropped, and we allocate from
there just fine in that case. So then why isn't this safe? It's
functionally the same thing, except we (usually) skip over the
intermediate step of putting the buffer on the freelist and taking it
off again.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Error logging messages
Next
From: Robert Haas
Date:
Subject: Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]