Re: BufferAlloc: don't take two simultaneous locks - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: BufferAlloc: don't take two simultaneous locks
Date
Msg-id 20220408.164645.1722424483743492345.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: BufferAlloc: don't take two simultaneous locks  (Yura Sokolov <y.sokolov@postgrespro.ru>)
Responses Re: BufferAlloc: don't take two simultaneous locks  (Yura Sokolov <y.sokolov@postgrespro.ru>)
List pgsql-hackers
At Thu, 07 Apr 2022 14:14:59 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrote in
> В Чт, 07/04/2022 в 16:55 +0900, Kyotaro Horiguchi пишет:
> > Hi, Yura.
> >
> > At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y.sokolov@postgrespro.ru> wrot
> > e in
> > > Ok, I got access to stronger server, did the benchmark, found weird
> > > things, and so here is new version :-)
> >
> > Thanks for the new version and benchmarking.
> >
> > > First I found if table size is strictly limited to NBuffers and FIXED,
> > > then under high concurrency get_hash_entry may not find free entry
> > > despite it must be there. It seems while process scans free lists, other
> > > concurrent processes "moves etry around", ie one concurrent process
> > > fetched it from one free list, other process put new entry in other
> > > freelist, and unfortunate process missed it since it tests freelists
> > > only once.
> >
> > StrategyGetBuffer believes that entries don't move across freelists
> > and it was true before this patch.
>
> StrategyGetBuffer knows nothing about dynahash's freelist.
> It knows about buffer manager's freelist, which is not partitioned.

Yeah, right. I meant get_hash_entry.

> > > To fix this issues I made following:
> > >
> > > # Concurrency
> > >
> > > First, I limit concurrency by introducing other lwlocks tranche -
> > > BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
> > > 128).
> > > If backend doesn't find buffer in buffer table and wants to introduce
> > > it, it first calls
> > >     LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
> > > If lock were acquired, then it goes to eviction and replace process.
> > > Otherwise, it waits lock to be released and repeats search.
> > >
> > > This greately improve performance for > 400 clients in pgbench.
> >
> > So the performance difference between the existing code and v11 is the
> > latter has a collision cross section eight times smaller than the
> > former?
>
> No. Acquiring EvictPartitionLock
> 1. doesn't block readers, since readers doesn't acquire EvictPartitionLock
> 2. doesn't form "tree of lock dependency" since EvictPartitionLock is
>   independent from PartitionLock.
>
> Problem with existing code:
> 1. Process A locks P1 and P2
> 2. Process B (p3-old, p1-new) locks P3 and wants to lock P1
> 3. Process C (p4-new, p1-old) locks P4 and wants to lock P1
> 4. Process D (p5-new, p4-old) locks P5 and wants to lock P4
> At this moment locks P1, P2, P3, P4 and P5 are all locked and waiting
> for Process A.
> And readers can't read from same five partitions.
>
> With new code:
> 1. Process A locks E1 (evict partition) and locks P2,
>    then releases P2 and locks P1.
> 2. Process B tries to locks E1, waits and retries search.
> 3. Process C locks E4, locks P1, then releases P1 and locks P4
> 4. Process D locks E5, locks P4, then releases P4 and locks P5
> So, there is no network of locks.
> Process A doesn't block Process D in any moment:
> - either A blocks C, but C doesn't block D at this moment
> - or A doesn't block C.
> And readers doesn't see simultaneously locked five locks which
> depends on single Process A.

Thansk for the detailed explanation. I see that.

> > +        * Prevent "thundering herd" problem and limit concurrency.
> >
> > this is something like pressing accelerator and break pedals at the
> > same time.  If it improves performance, just increasing the number of
> > buffer partition seems to work?
>
> To be honestly: of cause simple increase of NUM_BUFFER_PARTITIONS
> does improve average case.
> But it is better to cure problem than anesthetize.
> Increase of
> NUM_BUFFER_PARTITIONS reduces probability and relative
> weight of lock network, but doesn't eliminate.

Agreed.

> > It's also not great that follower backends runs a busy loop on the
> > lock until the top-runner backend inserts the new buffer to the
> > buftable then releases the newParititionLock.
> >
> > > I tried other variant as well:
> > > - first insert entry with dummy buffer index into buffer table.
> > > - if such entry were already here, then wait it to be filled.
> > > - otherwise find victim buffer and replace dummy index with new one.
> > > Wait were done with shared lock on EvictPartitionLock as well.
> > > This variant performed quite same.
> >
> > This one looks better to me. Since a partition can be shared by two or
> > more new-buffers, condition variable seems to work better here...
> >
> > > Logically I like that variant more, but there is one gotcha:
> > > FlushBuffer could fail with elog(ERROR). Therefore then there is
> > > a need to reliable remove entry with dummy index.
> >
> > Perhaps UnlockBuffers can do that.
>
> Thanks for suggestion. I'll try to investigate and retry this way
> of patch.
>
> > > And after all, I still need to hold EvictPartitionLock to notice
> > > waiters.
> > > I've tried to use ConditionalVariable, but its performance were much
> > > worse.
> >
> > How many CVs did you use?
>
> I've tried both NUM_PARTITION_LOCKS and NUM_PARTITION_LOCKS*8.
> It doesn't matter.
> Looks like use of WaitLatch (which uses epoll) and/or tripple
> SpinLockAcquire per good case (with two list traversing) is much worse
> than PgSemaphorLock (which uses futex) and single wait list action.

Sure.  I unintentionally neglected the overhead of our CV
implementation. It cannot be used in such a hot path.

> Other probability is while ConditionVariable eliminates thundering
> nerd effect, it doesn't limit concurrency enough... but that's just
> theory.
>
> In reality, I'd like to try to make BufferLookupEnt->id to be atomic
> and add LwLock to BufferLookupEnt. I'll test it, but doubt it could
> be merged, since there is no way to initialize dynahash's entries
> reliably.

Yeah, that's what came to my mind first (but with not a LWLock but a
CV) but gave up for the reason of additional size.  The size of
BufferLookupEnt is 24 and sizeof(ConditionVariable) is 12.  By the way
sizeof(LWLock) is 16..  So I think we don't take the per-bufentry
approach here for the reason of additional memory usage.

> > I don't think that causes significant performance hit, but I don't
> > understand how it improves freelist hit ratio other than by accident.
> > Could you have some reasoning for it?
>
> Since free_reused_entry returns entry into random free_list, this
> probability is quite high. In tests, I see stabilisa

Maybe.  Doesn't it improve the efficiency if we prioritize emptied
freelist on returning an element?  I tried it with an atomic_u32 to
remember empty freelist. On the uin32, each bit represents a freelist
index.  I saw it eliminated calls to element_alloc.  I tried to
remember a single freelist index in an atomic but there was a case
where two freelists are emptied at once and that lead to element_alloc
call.

> > By the way the change of get_hash_entry looks something wrong.
> >
> > If I understand it correctly, it visits num_freelists/4 freelists at
> > once, then tries element_alloc. If element_alloc() fails (that must
> > happen), it only tries freeList[freelist_idx] and gives up, even
> > though there must be an element in other 3/4 freelists.
>
> No. If element_alloc fails, it tries all NUM_FREELISTS again.
> - condition: `ntries || !allocFailed`. `!allocFailed` become true,
>   so `ntries` remains.
> - `ntries = num_freelists;` regardless of `allocFailed`.
> Therefore, all `NUM_FREELISTS` are retried for partitioned table.

Ah, okay. ntries is set to num_freelists after calling element_alloc.
I think we (I?) need more comments.

By the way, why it is num_freelists / 4 + 1?

> > > `free_reused_entry` now returns entry to random position. It flattens
> > > free entry's spread. Although it is not enough without other changes
> > > (thundering herd mitigation and probing more lists in get_hash_entry).
> >
> > If "thudering herd" means "many backends rush trying to read-in the
> > same page at once", isn't it avoided by the change in BufferAlloc?
>
> "thundering herd" reduces speed of entries migration a lot. But
> `simple_select` benchmark is too biased: looks like btree root is
> evicted from time to time. So entries are slowly migrated to of from
> freelist of its partition.
> Without "thundering herd" fix this migration is very fast.

Ah, that observation agree with the seemingly unidirectional migration
of free entries.

I remember that it is raised in this list several times to prioritize
index pages in shared buffers..

> > Up to about 15%(?) of gain is great.
>
> Up to 35% in "2 socket 3 key 1GB" case.
> Up to 44% in "2 socket 1 key 128MB" case.

Oh, more great!

> > I'm not sure it is okay that it seems to slow by about 1%..
>
> Well, in fact some degradation is not reproducible.
> Surprisingly, results change a bit from time to time.

Yeah.

> I just didn't rerun whole `master` branch bench again
> after v11 bench, since each whole test run costs me 1.5 hour.

Thans for the labor.

> But I confirm regression on "1 socket 1 key 1GB" test case
> between 83 and 211 connections. It were reproducible on
> more powerful Xeon 8354H, although it were less visible.
>
> Other fluctuations close to 1% are not reliable.

I'm glad to hear that.  It is not surprising that some fluctuation
happens.

> For example, sometimes I see degradation or improvement with
> 2GB shared buffers (and even more than 1%). But 2GB is enough
> for whole test dataset (scale 100 pgbench is 1.5GB on disk).
> Therefore modified code is not involved in benchmarking at all.
> How it could be explained?
> That is why I don't post 2GB benchmark results. (yeah, I'm
> cheating a bit).

If buffer replacement doesn't happen, theoretically this patch cannot
be involved in the fluctuation.  I think we can consider it an error.

It might come from placement of other variables. I have somethimes got
annoyed by such small but steady change of performance that persists
until I recompiled the whole tree.  But, sorry, I don't have a clear
idea of how such performance shift happens..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Handle infinite recursion in logical replication setup
Next
From: Dave Page
Date:
Subject: Re: Windows now has fdatasync()