Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Andres Freund
Subject Re: BufferAlloc: don't take two simultaneous locks
Date
Msg-id 20220225080455.3zzxwfpvbizzcsfo@alap3.anarazel.de
Whole thread Raw
In response to Re: BufferAlloc: don't take two simultaneous locks  (Michail Nikolaev <michail.nikolaev@gmail.com>)
Responses Re: BufferAlloc: don't take two simultaneous locks  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: BufferAlloc: don't take two simultaneous locks  (Yura Sokolov <y.sokolov@postgrespro.ru>)
List pgsql-hackers
Hi,

On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote:
> From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
> From: Yura Sokolov <y.sokolov@postgrespro.ru>
> Date: Mon, 21 Feb 2022 08:49:03 +0300
> Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.
> 
> Acquiring two partition locks leads to complex dependency chain that hurts
> at high concurrency level.
> 
> There is no need to hold both lock simultaneously. Buffer is pinned so
> other processes could not select it for eviction. If tag is cleared and
> buffer removed from old partition other processes will not find it.
> Therefore it is safe to release old partition lock before acquiring
> new partition lock.

Yes, the current design is pretty nonsensical. It leads to really absurd stuff
like holding the relation extension lock while we write out old buffer
contents etc.



> +     * We have pinned buffer and we are single pinner at the moment so there
> +     * is no other pinners.

Seems redundant.


> We hold buffer header lock and exclusive partition
> +     * lock if tag is valid. Given these statements it is safe to clear tag
> +     * since no other process can inspect it to the moment.
> +     */

Could we share code with InvalidateBuffer here? It's not quite the same code,
but nearly the same.


> +     * The usage_count starts out at 1 so that the buffer can survive one
> +     * clock-sweep pass.
> +     *
> +     * We use direct atomic OR instead of Lock+Unlock since no other backend
> +     * could be interested in the buffer. But StrategyGetBuffer,
> +     * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
> +     * compare tag, and UnlockBufHdr does raw write to state. So we have to
> +     * spin if we found buffer locked.

So basically the first half of of the paragraph is wrong, because no, we
can't?


> +     * Note that we write tag unlocked. It is also safe since there is always
> +     * check for BM_VALID when tag is compared.



>       */
>      buf->tag = newTag;
> -    buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
> -                   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
> -                   BUF_USAGECOUNT_MASK);
>      if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
> -        buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
> +        new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
>      else
> -        buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
> -
> -    UnlockBufHdr(buf, buf_state);
> +        new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
>  
> -    if (oldPartitionLock != NULL)
> +    buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
> +    while (unlikely(buf_state & BM_LOCKED))

I don't think it's safe to atomic in arbitrary bits. If somebody else has
locked the buffer header in this moment, it'll lead to completely bogus
results, because unlocking overwrites concurrently written contents (which
there shouldn't be any, but here there are)...

And or'ing contents in also doesn't make sense because we it doesn't work to
actually unset any contents?

Why don't you just use LockBufHdr/UnlockBufHdr?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Add checkpoint and redo LSN to LogCheckpointEnd log message