lwlock optimization opportunities - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | lwlock optimization opportunities |
Date | |
Msg-id | 20201230024244.aekvtxzztlng2qq7@alap3.anarazel.de Whole thread Raw |
List | pgsql-hackers |
Hi, This started as a reply to https://postgr.es/m/CAH2-WzkbCvgKrmw%2Bf%2B1hwgXhmiv%2BUNRihotALXftUiNr%3D3VUKA%40mail.gmail.com but after typing for a while I decided that it's large and unrelated enough to be better handled as a new thread. On 2020-11-27 11:08:49 -0800, Peter Geoghegan wrote: > We've made LWLocks much more scalable in the last 10 years. Why > shouldn't we expect to do the same again in the next 10 years? I > wouldn't bet against it. I might even do the opposite (predict further > improvements to LWLocks). I've done a bunch of benchmarking that shows clear areas of improvements: - For LWLockWaitForVar() / LWLockUpdateVar() we take the wait list lock - largely unnecessarily in a lot of cases. On platforms that have atomic 8 byte reads we don't need the lock to read/write *valptr. Nor do we need to look at the list of waiters in LWLockUpdateVar() if LW_FLAG_HAS_WAITERS is not set. WAL insertion is a fairly hot path, so it'd be nice to improve this. - For locks that are frequently share-locked (e.g. procarray lock, buffer locks) making the share-lock acquisition an atomic xadd, instead of a a compare-exchange loop is a boon. It does make the code a bit more complicated (to handle races where the optimistically added shared locker happens after an exclusive acquisition), which lead me to remove that optimization before committing the current lwlock lock protocol. But I think it's time to add it, because we end up with a lot of entirely unnecessary retries in read-only/mostly workloads. - Currently waking up waiters is O(#waiters) syscalls, because we wake up each waiting backend via a separate semaphore. In linux that boils down to a futex(semaphore, WAKE) syscall. The work the kernel needs to do for each of the futexes is substantial - in workloads with a lot of blocking I have seen 30% of the CPU time spent in related code. I played around with using futex() directly inside lwlock.c - yields quite a bit of benefits, because a) we can wake up many wakers at once, removing syscalls / futex lookups, b) there's far fewer different futexes & cachelines touched. Even things like only waking up shared-lockers etc can be done in one syscall, via FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET (with bits indicating the different wait modes). - LWLockAcquireOrWait(), which we use for WALWriteLock, can waste a lot of CPU time if (some) of the waiters actually need to wait for WAL to be written out. We wake up *all* the waiters, none of them takes the lock, then all of them try to acquire the lock again - that's a lot of contention on a single poor cacheline, and a lot of pressure on the OS scheduler. I think what we instead need is a combo of LWLockAcquire/Release that allows a callback to inspect what needs to be done with which waiters. In the WAL case such a callback could utilize a 'flush request position' publicized for each PGPROC to decide whether to wake the process because the lock request is already fulfilled, and wake exactly one of the remaining processes to actually acquire the lock (rather than exiting LWLockAcquireOrFalse as currently the case). - There are a large number of lwlocks - some of them bottlenecks - that are only ever taken in exclusive mode. For those locks we can use more efficient locking protocols: - Lock release does not need to be an atomic operation (there's no concurrent count of readers that concurrently can change). That can be a significant performance benefit. - Lock acquisition can be an atomic-exchange, instead of a compare-exchange. Unfortunately that probably requires designing the lock representation a bit differently. I don't know if it's feasible to share the data structure and just use different lock/unlock functions. There's more, but that's already a long list ;) Regards, Andres
pgsql-hackers by date: