Re: [PoC] Non-volatile WAL buffer - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [PoC] Non-volatile WAL buffer |
Date | |
Msg-id | 93100c91-66e9-faa6-704c-ac47634e1203@enterprisedb.com Whole thread Raw |
In response to | Re: [PoC] Non-volatile WAL buffer (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
RE: [PoC] Non-volatile WAL buffer
Re: [PoC] Non-volatile WAL buffer |
List | pgsql-hackers |
On 1/25/21 3:56 AM, Masahiko Sawada wrote: >> >> ... >> >> On 1/21/21 3:17 AM, Masahiko Sawada wrote: >>> ... >>> >>> While looking at the two methods: NTT and simple-no-buffer, I realized >>> that in XLogFlush(), NTT patch flushes (by pmem_flush() and >>> pmem_drain()) WAL without acquiring WALWriteLock whereas >>> simple-no-buffer patch acquires WALWriteLock to do that >>> (pmem_persist()). I wonder if this also affected the performance >>> differences between those two methods since WALWriteLock serializes >>> the operations. With PMEM, multiple backends can concurrently flush >>> the records if the memory region is not overlapped? If so, flushing >>> WAL without WALWriteLock would be a big benefit. >>> >> >> That's a very good question - it's quite possible the WALWriteLock is >> not really needed, because the processes are actually "writing" the WAL >> directly to PMEM. So it's a bit confusing, because it's only really >> concerned about making sure it's flushed. >> >> And yes, multiple processes certainly can write to PMEM at the same >> time, in fact it's a requirement to get good throughput I believe. My >> understanding is we need ~8 processes, at least that's what I heard from >> people with more PMEM experience. > > Thanks, that's good to know. > >> >> TBH I'm not convinced the code in the "simple-no-buffer" code (coming >> from the 0002 patch) is actually correct. Essentially, consider the >> backend needs to do a flush, but does not have a segment mapped. So it >> maps it and calls pmem_drain() on it. >> >> But does that actually flush anything? Does it properly flush changes >> done by other processes that may not have called pmem_drain() yet? I >> find this somewhat suspicious and I'd bet all processes that did write >> something have to call pmem_drain(). > For the record, from what I learned / been told by engineers with PMEM experience, calling pmem_drain() should properly flush changes done by other processes. So it should be sufficient to do that in XLogFlush(), from a single process. My understanding is that we have about three challenges here: (a) we still need to track how far we flushed, so this needs to be protected by some lock anyway (although perhaps a much smaller section of the function) (b) pmem_drain() flushes all the changes, so it flushes even "future" part of the WAL after the requested LSN, which may negatively affects performance I guess. So I wonder if pmem_persist would be a better fit, as it allows specifying a range that should be persisted. (c) As mentioned before, PMEM behaves differently with concurrent access, i.e. it reaches peak throughput with relatively low number of threads wroting data, and then the throughput drops quite quickly. I'm not sure if the same thing applies to pmem_drain() too - if it does, we may need something like we have for insertions, i.e. a handful of locks allowing limited number of concurrent inserts. > Yeah, in terms of experiments at least it's good to find out that the > approach mmapping each WAL segment is not good at performance. > Right. The problem with small WAL segments seems to be that each mmap causes the TLB to be thrown away, which means a lot of expensive cache misses. As the mmap needs to be done by each backend writing WAL, this is particularly bad with small WAL segments. The NTT patch works around that by doing just a single mmap. I wonder if we could pre-allocate and mmap small segments, and keep them mapped and just rename the underlying files when recycling them. That'd keep the regular segment files, as expected by various tools, etc. The question is what would happen when we temporarily need more WAL, etc. >>> >>> ... >>> >>> I think the performance improvement by NTT patch with the 16MB WAL >>> segment, the most common WAL segment size, is very good (150437 vs. >>> 212410 with 64 clients). But maybe evaluating writing WAL segment >>> files on PMEM DAX filesystem is also worth, as you mentioned, if we >>> don't do that yet. >>> >> >> Well, not sure. I think the question is still open whether it's actually >> safe to run on DAX, which does not have atomic writes of 512B sectors, >> and I think we rely on that e.g. for pg_config. But maybe for WAL that's >> not an issue. > > I think we can use the Block Translation Table (BTT) driver that > provides atomic sector updates. > But we have benchmarked that, see my message from 2020/11/26, which shows this table: master/btt master/dax ntt simple ----------------------------------------------------------- 1 5469 7402 7977 6746 16 48222 80869 107025 82343 32 73974 158189 214718 158348 64 85921 154540 225715 164248 96 150602 221159 237008 217253 Clearly, BTT is quite expensive. Maybe there's a way to tune that at filesystem/kernel level, I haven't tried that. >> >>>> I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a >>>> huge read-write assymmetry (the writes being way slower), and their >>>> recommendation (in "Observation 3" is) >>>> >>>> The read-write asymmetry of PMem im-plies the necessity of avoiding >>>> writes as much as possible for PMem. >>>> >>>> So maybe we should not be trying to use PMEM for WAL, which is pretty >>>> write-heavy (and in most cases even write-only). >>> >>> I think using PMEM for WAL is cost-effective but it leverages the only >>> low-latency (sequential) write, but not other abilities such as >>> fine-grained access and low-latency random write. If we want to >>> exploit its all ability we might need some drastic changes to logging >>> protocol while considering storing data on PMEM. >>> >> >> True. I think investigating whether it's sensible to use PMEM for this >> purpose. It may turn out that replacing the DRAM WAL buffers with writes >> directly to PMEM is not economical, and aggregating data in a DRAM >> buffer is better :-( > > Yes. I think it might be interesting to do an analysis of the > bottlenecks of NTT patch by perf etc. If bottlenecks are moved to > other places by removing WALWriteLock during flush, it's probably a > good sign for further performance improvements. IIRC WALWriteLock is > one of the main bottlenecks on OLTP workload, although my memory might > already be out of date. > I think WALWriteLock itself (i.e. acquiring/releasing it) is not an issue - the problem is that writing the WAL to persistent storage itself is expensive, and we're waiting to that. So it's not clear to me if removing the lock (and allowing multiple processes to do pmem_drain concurrently) can actually help, considering pmem_drain() should flush writes from other processes anyway. But as I said, that is just my theory - I might be entirely wrong, it'd be good to hack XLogFlush a bit and try it out. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: