Re: [PoC] Non-volatile WAL buffer - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [PoC] Non-volatile WAL buffer |
Date | |
Msg-id | 361cf3fd-40d8-366a-a50a-778a91ff52bb@enterprisedb.com Whole thread Raw |
In response to | Re: [PoC] Non-volatile WAL buffer (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>) |
Responses |
Re: [PoC] Non-volatile WAL buffer
|
List | pgsql-hackers |
On 1/22/21 5:04 PM, Konstantin Knizhnik wrote: > ... > > I have heard from several DBMS experts that appearance of huge and > cheap non-volatile memory can make a revolution in database system > architecture. If all database can fit in non-volatile memory, then we > do not need buffers, WAL, ...> > But although multi-terabyte NVM announces were made by IBM several > years ago, I do not know about some successful DBMS prototypes with new > architecture. > > I tried to understand why... > IMHO those predictions are a bit too optimistic, because they often assume PMEM behavior is mostly similar to DRAM, except for the extra persistence. But that's not quite true - throughput with PMEM is much lower in general, peak throughput is reached with few processes (and then drops quickly) etc. But over the last few years we were focused on optimizing for exactly the opposite - systems with many CPU cores and processes, because that's what maximizes DRAM throughput. I'm not saying a revolution is not possible, but it'll probably require quite significant rethinking of the whole architecture, and it may take multiple PMEM generations until the performance improves enough to make this economical. Some systems are probably more suitable for this (e.g. Redis is doing most of the work in a single process, IIRC). The other challenge of course is availability of the hardware - most users run on whatever is widely available at cloud providers. And PMEM is unlikely to get there very soon, I'd guess. Until that happens, the pressure from these customers will be (naturally) fairly low. Perhaps someone will develop hardware appliances for on-premise setups, as was quite common in the past. Not sure. > It was very interesting to me to read this thread, which is actually > started in 2016 with "Non-volatile Memory Logging" presentation at PGCon. > As far as I understand from Tomas result right now using PMEM for WAL > doesn't provide some substantial increase of performance. > At the moment, I'd probably agree. It's quite possible the PoC patches are missing some optimizations and the difference might be better, but even then the performance increase seems fairly modest and limited to certainly workloads. > But the main advantage of PMEM from my point of view is that it allows > to avoid write-ahead logging at all! No, PMEM certainly does not allow avoiding write-ahead logging - we still need to handle e.g. recovery after a crash, when the data files are in unknown / corrupted state. Not to mention that WAL is used for physical and logical replication (and thus HA), and so on. > Certainly we need to change our algorithms to make it possible. Speaking > about Postgres, we have to rewrite all indexes + heap > and throw away buffer manager + WAL. > The problem with removing buffer manager and just writing everything directly to PMEM is the worse latency/throughput (compared to DRAM). It's probably much more efficient to combine multiple writes into RAM and then do one (much slower) write to persistent storage, than pay the higher latency for every write. It might make sense for data sets that are larger than DRAM but can fit into PMEM. But that seems like fairly rare case, and even then it may be more efficient to redesign the schema to fit into RAM somehow (sharding, partitioning, ...). > What can be used instead of standard B-Tree? > For example there is description of multiword-CAS approach: > > http://justinlevandoski.org/papers/mwcas.pdf > > and BzTree implementation on top of it: > > https://www.cc.gatech.edu/~jarulraj/papers/2018.bztree.vldb.pdf > > There is free BzTree implementation at github: > > git@github.com:sfu-dis/bztree.git > > I tried to adopt it for Postgres. It was not so easy because: > 1. It was written in modern C++ (-std=c++14) > 2. It supports multithreading, but not mutliprocess access > > So I have to patch code of this library instead of just using it: > > git@github.com:postgrespro/bztree.git > > I have not tested yet most iterating case: access to PMEM through PMDK. > And I do not have hardware for such tests. > But first results are also seem to be interesting: PMwCAS is kind of > lockless algorithm and it shows much better scaling at > NUMA host comparing with standard Postgres. > > I have done simple parallel insertion test: multiple clients are > inserting data with random keys. > To make competition with vanilla Postgres more honest I used unlogged > table: > > create unlogged table t(pk int, payload int); > create index on t using bztree(pk); > > randinsert.sql: > insert into t (payload,pk) values > (generate_series(1,1000),random()*1000000000); > > pgbench -f randinsert.sql -c N -j N -M prepared -n -t 1000 -P 1 postgres > > So each client is inserting one million records. > The target system has 160 virtual and 80 real cores with 256GB of RAM. > Results (TPS) are the following: > > N nbtree bztree > 1 540 455 > 10 993 2237 > 100 1479 5025 > > So bztree is more than 3 times faster for 100 clients. > Just for comparison: result for inserting in this table without index is > 10k TPS. > I'm not familiar with bztree, but I agree novel indexing structures are an interesting topic on their own. I only quickly skimmed the bztree paper, but it seems it might be useful even on DRAM (assuming it will work with replication etc.). The other "problem" with placing data files (tables, indexes) on PMEM and making this code PMEM-aware is that these writes generally happen asynchronously in the background, so the impact on transaction rate is fairly low. This is why all the patches in this thread try to apply PMEM on the WAL logging / flushing, which is on the critical path. > I am going then try to play with PMEM. > If results will be promising, then it is possible to think about > reimplementation of heap and WAL-less Postgres! > > I am sorry, that my post has no direct relation to the topic of this > thread (Non-volatile WAL buffer). > It seems to be that it is better to use PMEM to eliminate WAL at all > instead of optimizing it. > Certainly, I realize that WAL plays very important role in Postgres: > archiving and replication are based on WAL. So even if we can live > without WAL, it is still not clear whether we really want to live > without it. > > One more idea: using multiword CAS approach requires us to make changes > as editing sequences. > Such editing sequence is actually ready WAL records. So implementors of > access methods do not have to do > double work: update data structure in memory and create correspondent > WAL records. Moreover, PMwCAS operations are atomic: > we can replay or revert them in case of fault. So there is no need in > FPW (full page writes) which have very noticeable impact on WAL size and > database performance. > regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: