Buffer locking is special (hints, checksums, AIO writes) - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Buffer locking is special (hints, checksums, AIO writes) |
Date | |
Msg-id | fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff Whole thread Raw |
Responses |
Re: Buffer locking is special (hints, checksums, AIO writes)
|
List | pgsql-hackers |
Hi, I'm working on making bufmgr.c ready for AIO writes. That requires some infrastructure changes that I wanted to discuss... In this email I'm trying to outline the problems and what I currently think we should do. == Problem 1 - hint bits == Currently pages that are being written out are copied if checksums are enabled. The copy is needed because hint bits can be set with just a share lock. Writing out a buffer only requires a share lock. If we didn't copy the buffer, the computed checksum can be falsified by the checksum. Even without checksums hint bits being set while IO is ongoing is an issue, e.g. btrfs assumes that pages do not change while being written out with direct-io, and corrupts itself if they do [1]. The copy we make to avoid checksum failure are not at all cheap [2], but are particularly problematic for AIO, where a lot of buffers can undergo AIO at the same time. For AIO the issue is that the longer-lived bounce buffers that I had initially prototyped actually end up using a lot of memory, making it hard to figure out what defaults to set. In [2] I had worked on an approach to avoid copying pages during writes. It avoided the need to copy buffers by not allowing hint bits to be set while IO is ongoing. It did so by skipping hint bit writes while IO is ongoing (plus a fair bit of infrastructure to make that cheap enough). In that thread Heikki was wondering whether we should instead not go for a more fundamental solution to the problem, like introducing a separate lock level that prevents concurrent modifications but allows share lockers. At the time I was against that, because it seemed like a large project. However, since then I realized there's a architecturally related second issue: == Problem 2 - AIO writes vs exclusive locks == Separate from the hint bit issue, there is a second issue that I didn't have a good answer for: Making acquiring an exclusive lock concurrency safe in the presence of asynchronous writes: The problem is that while a buffer is being written out, it obviously has to be share locked. That's true even with AIO. With AIO the share lock is held once the IO is completed. The problem is that if a backend wants to exclusively lock a buffer undergoing AIO, it can't just wait for the content lock as today, it might have to actually reap the IO completion from the operating system. If one just were to wait for the content lock, there's no forward progress guarantee. The buffer's state "knows" that it's undergoing write IO (BM_VALID and BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The problem is that it's surprisingly hard to do so race free: If a backend A were to just check if a buffer is undergoing IO before locking it, a backend B could start IO on the buffer between A checking for BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just hold the buffer header spinlock across a blocking lwlock acquisition. There potentially are ways to synchronize the buffer state and the content lock, but it requires deep integration between bufmgr.c and lwlock.c. == Problem 3 - Cacheline contention == This is unrelated to AIO, but might influence the architecture for potential solutions. It might make sense to skip over this section on a quicker read-through. The problem is that in some workloads the cacheline containing the BufferDesc becomes very hot. The most common case are workloads with lots of index nested loops, the root page and some of the other inner pages in the index can become very contended. I've seen cases where running the same workload in four separate copies in the same postgres instance yields ~8x the throughput - on a machine with forty cores, there's machines with many more these days [4]/ There are three main issues: a) We manipulate the same cache line multiple times. E.g. btree always first pins the page and then locks the page, which performs two atomic operations on the same cacheline. I've experimented with putting the content lock on a separate cacheline, but that causes regressions in other cases. Leaving nontrivial implementation issues aside, we could release the lock and pin at the same time. With a bit increased difficulty we could do the same for pin and lock acquisition. nbtree almost always does the two together, so it'd not be hard to make it benefit from such an optimization. b) Many operations, like unpinning a buffer, need to use CAS, instead of an atomic subtraction. atomic add/sub scales a *lot* better than compare and swap, as there is no need to retry. With increased contention the number of retries increases further and further. The reason we need to use CAS in so many places is the following: Historically postgres' spinlocks don't use an atomic operation to release the spinlock on x86. When making buffer header locks their own thing, I carried that forward, to avoid performance regressions. Because of that BufferDesc.state may not be modified while the buffer header spinlock is held - which is incompatible with using atomic-sub. However, since then we converted most of the "hot" uses of buffer header spinlocks into CAS loops (see e.g. PinBuffer()). That makes it feasible to use an atomic operation for the buffer header lock release, which in turn allows e.g. unpinning with an atomic sub. c) Read accesses to the BufferDesc cause contention Some code, like nbtree, relies on functions like BufferGetBlockNumber(). Unfortunately that contends with concurrent modifications of the buffer descriptor (like pinning). Potential solutions are to rely less on functions like BufferGetBlockNumber() or to split out the memory for that into a separate (denser?) array. d) Even after addressing all of the above, there's still a lot of contention I think the solution here would be something roughly to fastpath locks. If a buffer is very contended, we can mark it as super-pinned & share locked, avoiding any atomic operation on the buffer descriptor itself. Instead the current lock and pincount would be stored in each backends PGPROC. Obviously evicting or exclusively-locking such a buffer would be a lot more expensive. I've prototyped it and it helps a *lot*. The reason I mention this here is that this seems impossible to do while using the generic lwlocks for the content lock. == Solution ? == My conclusion from the above is that we ought to: A) Make Buffer Locks something separate from lwlocks As part of that introduce a new lock level in-between share and exclusive locks (provisionally called share-exclusive, but I hate it). The new lock level allows other share lockers, but can only be held by one backend. This allows to change the rules so that: 1) Share lockers are not allowed to modify buffers anymore 2) Hint bits need the new lock mode (conditionally upgrading the lock in SetHintBits()) 3) Write IO needs to the new lock level This addresses 1) from above B) Merge BufferDesc.state and the content lock This allows to address 2) from above, as we now atomically can check if IO was concurrently initiated. Obviously BufferDesc.state is not currently wide enough, therefore the buffer state has to be updated to a 64bit variable. C) Allow some modifications of BufferDesc.state while holding spinlock Just naively doing the above two things reduces scalability, as the likelihood of CAS failures increases, due to increased number of modifications of the same atomic variable. However, by allowing unpinning while the buffer header spinlock is held, scalability considerably improves in my tests. Doing so requires changing all uses of LockBufHdr(), but by introducing a static inline helper the complexity can largely be encapsulated. I've prototyped the above. The current state is pretty rough, but before I spend the non-trivial time to make it into an understandable sequence of changes, I wanted to get feedback. Does this plan sound reasonable? The hardest part about this change is that everything kind of depends on each other. The changes are large enough that they clearly can't just be committed at once, but doing them over time risks [temporary] performance regressions. The order of changes I think makes the most sense is the following: 1) Allow some modifications while holding the buffer header spinlock 2) Reduce buffer pin with just an atomic-sub This needs to happen first, otherwise there are performance regressions during the later steps. 3) Widen BufferDesc.state to 64 bits 4) Implement buffer locking inside BufferDesc.state 5) Do IO while holding share-exclusive lock and require all buffer modifications to at least hold share exclusive lock 6) Wait for AIO when acquiring an exclusive content lock (some of these will likely have parts of their own, but that's details) Sane? DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!? Greetings, Andres Freund [1] https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com [2] https://www.postgresql.org/message-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m [3] In some cases it causes slowdowns for checkpointer close to 50%! [4] https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf - slide 19
pgsql-hackers by date: