Re: Online checksums verification in the backend - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Online checksums verification in the backend
Date
Msg-id 20201102013940.tk5e2co2maedzksw@alap3.anarazel.de
Whole thread Raw
In response to Re: Online checksums verification in the backend  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Online checksums verification in the backend  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Hi,

I'm a bit limited writing - one handed for a while following an injury
on Friday...

On 2020-11-02 10:05:25 +0900, Michael Paquier wrote:
> On Thu, Oct 29, 2020 at 10:08:52PM -0700, Andres Freund wrote:
> > I think its pretty much *never* OK to do IO while holding a buffer
> > mapping lock. You're locking a significant fraction of shared buffers
> > over IO. That's just not OK.  Don't think there's any place doing so
> > currently either.
> 
> There is no place doing that on HEAD.

Err?

    /* see if the block is in the buffer pool or not */
    LWLockAcquire(partLock, LW_SHARED);
    buf_id = BufTableLookup(&buf_tag, buf_hash);
    if (buf_id >= 0)
    {
...
        /* Read the buffer from disk, with the I/O lock still held */
        smgrread(smgr, forknum, blkno, buffer);
        LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
    }
    else
    {
        /*
         * Simply read the buffer.  There's no risk of modification on it as
         * we are holding the buffer pool partition mapping lock.
         */
        smgrread(smgr, forknum, blkno, buffer);
    }

    /* buffer lookup done, so now do its check */
    LWLockRelease(partLock);

How is this not doing IO while holding a buffer mapping lock?


> This specific point was mentioned in the first message of this thread,
> 7th paragraph.  That was a long thread, so it is easy to miss:
> https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com

The code clearly doesnt implement it that way.


> I am wondering what you have in mind regarding the use of
> BM_IO_IN_PROGRESS or a similar flag.  Wouldn't that imply some
> consequences for other existing buffers in the table, like a possible
> eviction?

You'd need exactly one empty buffer for that - it can be reused for the
next to-be-checked buffer.


> I'd like to think that we should not do any manipulation of
> the buffer tables in this case.

Why? Its the way we lock buffers - why is this so special that we need
to do differently?


> Hence, in order to prevent a
> concurrent activity to load in shared buffers the page currently
> checked on disk, I got to think that we would need something new here,
> like a filtering hash table that would be checked each time a backend
> tries to insert an entry into the buffer tables.

Thats going to slow down everything a bit - the mapping already is a
bottleneck.


> 1) If the buffer is in shared buffers, we have the APIs to solve that
> by using a pin, unlock the partition, and then do the I/O.  (Still
> that's unsafe with the smgrwrite() argument?)

Thats why you need an appropriate relation lock... Something CheckBuffer
didnt bother to document. Its a restriction, but one we probably can
live with.


> 2) If the buffer is not in shared buffers, we don't have what it takes
> to solve the problem yet.

We do. Set up enough state for the case to be otherwise the same as the
in s_b case.

> But even if we solve this problem, we will
> never really be sure that this is entirely safe, as per the argument
> with concurrent smgrwrite() calls.  Current in-core code assumes that
> this can happen only for init forks of unlogged relations which would
> not be visible yet in the backend doing a page check, still it can be
> really easy to break this assumption with any new code added by a new
> feature.

It also happens in a few other cases than just init forks. But
visibility & relation locking can take care of that. But you need to
document that. If the locking allows concurent readers - and especially
concurrent writers, then you cant really use smgrwite for anything but
relation extension.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Explicit NULL dereference (src/backend/utils/adt/ruleutils.c)
Next
From: Michael Paquier
Date:
Subject: Re: Online checksums verification in the backend