Re: [PoC] Non-volatile WAL buffer - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: [PoC] Non-volatile WAL buffer
Date
Msg-id fb1490ce-24c6-4e60-0774-a29752bc97cb@postgrespro.ru
Whole thread Raw
In response to Re: [PoC] Non-volatile WAL buffer  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
Thank you for your feedback.

On 19.02.2021 6:25, Tomas Vondra wrote:
> On 1/22/21 5:04 PM, Konstantin Knizhnik wrote:
>> ...
>>
>> I have heard from several DBMS experts that appearance of huge and
>> cheap non-volatile memory can make a revolution in database system
>> architecture. If all database can fit in non-volatile memory, then we
>> do not need buffers, WAL, ...>
>> But although  multi-terabyte NVM announces were made by IBM several
>> years ago, I do not know about some successful DBMS prototypes with new
>> architecture.
>>
>> I tried to understand why...
>>
> IMHO those predictions are a bit too optimistic, because they often
> assume PMEM behavior is mostly similar to DRAM, except for the extra
> persistence. But that's not quite true - throughput with PMEM is much
> lower
Actually it is not completely true.
There are several types of NVDIMMs.
Most popular now is NVDIMM-N which is just combination of DRAM and flash.
Speed it the same as of normal DRAM, but size of such memory is also 
comparable with DRAM.
So I do not think that it is perspective approach.
And definitely speed of Intel Optane memory is much slower than of DRAM.
>> But the main advantage of PMEM from my point of view is that it allows
>> to avoid write-ahead logging at all!
> No, PMEM certainly does not allow avoiding write-ahead logging - we
> still need to handle e.g. recovery after a crash, when the data files
> are in unknown / corrupted state.

It is possible to avoid write-ahead logging if we use special algorithms 
(like PMwCAS)
which ensures atomicity of updates.
> The problem with removing buffer manager and just writing everything
> directly to PMEM is the worse latency/throughput (compared to DRAM).
> It's probably much more efficient to combine multiple writes into RAM
> and then do one (much slower) write to persistent storage, than pay the
> higher latency for every write.
>
> It might make sense for data sets that are larger than DRAM but can fit
> into PMEM. But that seems like fairly rare case, and even then it may be
> more efficient to redesign the schema to fit into RAM somehow (sharding,
> partitioning, ...).

Certainly avoid buffering will make sense only if speed of accessing 
PMEM will be comparable with DRAM.
> So I have to patch code of this library instead of just using it:
>
>    git@github.com:postgrespro/bztree.git
>
> I have not tested yet most iterating case: access to PMEM through PMDK.
> And I do not have hardware for such tests.
> But first results are also seem to be interesting: PMwCAS is kind of
> lockless algorithm and it shows much better scaling at
> NUMA host comparing with standard Postgres.
>
> I have done simple parallel insertion test: multiple clients are
> inserting data with random keys.
> To make competition with vanilla Postgres more honest I used unlogged
> table:
>
> create unlogged table t(pk int, payload int);
> create index on t using bztree(pk);
>
> randinsert.sql:
> insert into t (payload,pk) values
> (generate_series(1,1000),random()*1000000000);
>
> pgbench -f randinsert.sql -c N -j N -M prepared -n -t 1000 -P 1 postgres
>
> So each client is inserting one million records.
> The target system has 160 virtual and 80 real cores with 256GB of RAM.
> Results (TPS) are the following:
>
> N      nbtree      bztree
> 1           540          455
> 10         993        2237
> 100     1479        5025
>
> So bztree is more than 3 times faster for 100 clients.
> Just for comparison: result for inserting in this table without index is
> 10k TPS.
>
> I'm not familiar with bztree, but I agree novel indexing structures are
> an interesting topic on their own. I only quickly skimmed the bztree
> paper, but it seems it might be useful even on DRAM (assuming it will
> work with replication etc.).
>
> The other "problem" with placing data files (tables, indexes) on PMEM
> and making this code PMEM-aware is that these writes generally happen
> asynchronously in the background, so the impact on transaction rate is
> fairly low. This is why all the patches in this thread try to apply PMEM
> on the WAL logging / flushing, which is on the critical path.

I want to make an update on my prototype.
Unfortunately my attempt to use bztree with PMEM failed,
because of two problems:

1. Used libpmemobj/bztree libraries are not compatible with Postgres 
architecture.
Them support concurrent access, but by multiple threads within one 
process (widely use thread-local variables).
The traditional Postgres approach (initialize shared data structures in 
postmaster
(shared_preload_libraries) and inherit it by forked child processes) 
doesn't work for libpmemobj.
If child doesn't open pmem itself, then any access to it cause crash.
And in case of openning pmem by child, it is assigned different virtual 
memory address.
But bztree and pmwcas implementations expect that addresses are the same 
in all clients.

2. There is some bug in bztree/pmwcas implementation which cause its own 
test to hang in case of multithreaded
access in persistence mode. I tried to find the reason of the problem 
but didn;t succeed yet: PMwCAS implementation is very non-trivial).

So I just compared single threaded  performance of bztree test: with 
Intel Optane it was about two times worser
than with volatile memory.

I still wonder if using bztree just as in-memory index will be 
interested because it is scaling much better than Postgres B-Tree and 
even our own PgPro
in_memory extension. But certainly volatile index has very limited 
usages. Also full support of all Postgres types in bztree requires a lot 
of efforts
(right now I support only equality comparison).

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: ERROR: "ft1" is of the wrong type.
Next
From: Dent John
Date:
Subject: Re: [WIP] UNNEST(REFCURSOR): allowing SELECT to consume data from a REFCURSOR