Thread: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

[HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Yoshimi Ichiyanagi
Date:
Hi.

These patches enable to use Persistent Memory Development Kit(PMDK)[1]
for reading/writing WAL logs on persistent memory(PMEM).
PMEM is next generation storage and it has a number of nice features:
fast, byte-addressable and non-volatile.

Using pgbench which is a PostgreSQL general benchmark, the postgres server
to which the patches is applied is about 5% faster than original server.
And using my insert benchmark, it is up to 90% faster than original one.
I will describe these details later.


This e-mail describes the following:
A) About PMDK
B) About the patches
C) The way of running benchmarks using the patches, and the results


A) About PMDK
PMDK provides the functions to allow an application to directly access
PMEM without going through the kernel as a memory for the purpose of
high-speed access to PMEM by the application.
The following APIs are available through PMDK.
A-1. APIs to open a file on PMEM, to create a file on PMEM,
     and to map a file on PMEM to virtual addresses
A-2. APIs to read/write data from and to a file on PMEM


A-1. APIs to open a file on PMEM, to create a file on PMEM,
     and to map a file on PMEM to virtual addresses

PMDK provides these APIs using DAX filesystem(DAX FS)[2] feature. 

DAX FS is a PMEM-aware file system which allows direct access
to PMEM without using the kernel page caches. A file in DAX FS can
be mapped to memory using standard calls like mmap() on Linux. 
Furthermore by mapping the file on PMEM to virtual addresses(and
after any initial minor page faults that may be required to create
the mappings in the MMU), the applications can access PMEM
using CPU load/store instructions instead of read/write system calls.


A-2. APIs to read/write data from and to a file on PMEM

PMDK provides the APIs like memcpy() to copy data to PMEM
using single instruction, multiple data(SIMD) instructions[3] and
NT store instructions[4]. These instructions improve the performance
to copy data to PMEM. As a result, using these APIs is faster than
using read/write system calls.


[1] http://pmem.io/pmdk/
[2] https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf
[3] SIMD: SIMD is the instruction operates on all loaded data in a single
    operation. If the SIMD system loads eight data into registers at once,
    the store operation to PMEM will happen to all eight values
    at the same time.
[4] NT store instructions: NT store instructions bypass the CPU cache,
    so using these instructions does not require a flush.


B) About the patches
Changes by the patches:
0001-Add-configure-option-for-PMDK.patch:
- Added "--with-libpmem" configure option to execute I/O with PMDK library

0002-Read-write-WAL-files-using-PMDK.patch:
- Added PMDK implementation for WAL I/O operations
- Added "pmem-drain" to the wal_sync_method parameter list
  to write logs synchronously on PMEM

0003-Walreceiver-WAL-IO-using-PMDK.patch:
- Added PMDK implementation for Walreceiver of secondary server processes



C) The way of running benchmarks using the patches, and the results
It's the following:

Experimental setup
Server: HP ProLiant DL360 Gen9
CPU:    Xeon E5-2667 v4 (3.20GHz); 2 processors(without HT)
DRAM:   DDR4-2400; 32 GiB/processor
        (8GiB/socket x 4 sockets/processor) x 2 processors
NVDIMM: DDR4-2133; 32 GiB/processor
        (8GiB/socket x 4 sockets/processor) x 2 processors
HDD:    Seagate Constellation2 2.5inch SATA 3.0. 6Gb/s 1TB 7200rpm x 1
OS:     Ubuntu 16.04, linux-4.12
DAX FS: ext4
NVML:   master@Aug 30, 2017
PostgreSQL: master
Note: I bound the postgres processes to one NUMA node, 
      and the benchmarks to other NUMA node.


C-1. Configuring PMEM for using as a block device
# ndctl list
# ndctl create-namespace -f -e namespace0.0 --mode=memory -M dev

C-2. Creating a file system on PMEM, and mounting it with DAX
# mkfs.ext4 /dev/pmem0
# mount -t ext4 -o dax /dev/pmem0 /mnt/pmem0

C-3. Setting PMEM_IS_PMEM_FORCE to determine if the WAL files is stored
     on PMEM
Note: If this environment variable is not set, postgres processes are
      not started.
# export PMEM_IS_PMEM_FORCE=1

C-4. Installing PostgreSQL
Note: There are 3 important things in installing PostgreSQL.
a. Executing "./configure --with-libpmem" to link libpmem
b. Setting WAL directory on PMEM
c. Modifying wal_sync_method parameter in postgresql.conf from fdatasync
   to pmem_drain

# cd /path/to/[PG_source dir]
# ./configure --with-libpmem
# make && make install
# initdb /path/to/PG_DATA -X /mnt/pmem0/path/to/[PG_WAL dir]
# cat /path/to/PG_DATA/postgresql.conf | sed -e s/#wal_sync_method\ =\ 
fsync/wal_sync_method\ =\ pmem_drain/ > /path/to/PG_DATA/postgresql.conf.
tmp
# mv /path/to/PG_DATA/postgresql.conf.tmp /path/to/PG_DATA/postgresql.conf
# pg_ctl start -D /path/to/PG_DATA
# created [DB_NAME]

C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark)
C-5-1. pgbench
# numactl -N 1 pgbech -c 32 -j 8 -T 120 -M prepared [DB_NAME]

The averages of running pgbench three times are:
wal_sync_method=fdatasync:   tps = 43,179
wal_sync_method=pmem_drain:  tps = 45,254

C-5-2. pclinet_thread: my insert benchmark
Preparation
CREATE TABLE [TABLE_NAME] (id int8, value text);
ALTER TABLE [TABLE_NAME] ALTER value SET STORAGE external;
PREPARE insert_sql (int8) AS INSERT INTO %s (id, value) values ($1, '
[1K_data]');

Execution
BEGIN; EXECUTE insert_sql(%lld); COMMIT;
Note: I ran this quer 5M times with 32 threads. 

# ./pclient_thread
Invalid Arguments:
Usage: ./pclient_thread [The number of threads] [The number to insert 
tuples] [data size(KB)]
# numactl -N 1 ./pclient_thread 32 5242880 1


The averages of running this benchmark three times are:
wal_sync_method=fdatasync:   tps =  67,780
wal_sync_method=pmem_drain:  tps = 131,962

--
Yoshimi Ichiyanagi
Attachment

Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Robert Haas
Date:
On Tue, Jan 16, 2018 at 2:00 AM, Yoshimi Ichiyanagi
<ichiyanagi.yoshimi@lab.ntt.co.jp> wrote:
> Using pgbench which is a PostgreSQL general benchmark, the postgres server
> to which the patches is applied is about 5% faster than original server.
> And using my insert benchmark, it is up to 90% faster than original one.
> I will describe these details later.

Interesting.  But your insert benchmark looks highly artificial... in
real life, you would not insert the same long static string 160
million times.  Or if you did, you would use COPY or INSERT .. SELECT.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Robert Haas
Date:
On Tue, Jan 16, 2018 at 2:00 AM, Yoshimi Ichiyanagi
<ichiyanagi.yoshimi@lab.ntt.co.jp> wrote:
> C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark)
> C-5-1. pgbench
> # numactl -N 1 pgbech -c 32 -j 8 -T 120 -M prepared [DB_NAME]
>
> The averages of running pgbench three times are:
> wal_sync_method=fdatasync:   tps = 43,179
> wal_sync_method=pmem_drain:  tps = 45,254

What scale factor was used for this test?

Was the only non-default configuration setting wal_sync_method?  i.e.
synchronous_commit=on?  No change to max_wal_size?

This seems like an exceedingly short test -- normally, for write
tests, I recommend the median of 3 30-minute runs.  It also seems
likely to be client-bound, because of the fact that jobs = clients/4.
Normally I use jobs = clients or at least jobs = clients/2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Yoshimi Ichiyanagi
Date:
Thank you for your reply.

<CA+TgmobUrKBWgOa8x=mbW4Cmsb=NeV8Egf+RSLp7XiCAjHdmgw@mail.gmail.com>
Wed, 17 Jan 2018 15:29:11 -0500Robert Haas <robertmhaas@gmail.com> wrote :
>> Using pgbench which is a PostgreSQL general benchmark, the postgres server
>> to which the patches is applied is about 5% faster than original server.
>> And using my insert benchmark, it is up to 90% faster than original one.
>> I will describe these details later.
>
>Interesting.  But your insert benchmark looks highly artificial... in
>real life, you would not insert the same long static string 160
>million times.  Or if you did, you would use COPY or INSERT .. SELECT.

I made this benchmark in order to put very heavy WAL I/O load on PMEM.

PMEM is very fast. I ran the micro-benchmark test like fio on PMEM.
This workload involved 8K Bytes-block synchronous sequential writes,
and the total write size was 40G Bytes.

The micro-benchmark result was the following.
Using DAX FS(like fdatasync):             5,559 MB/sec
Using DAX FS and PMDK(like pmem_drain):  13,177 MB/sec

Using pgbench, the postgres server to which my patches were applied was
only 5% faster than the original server.
>> The averages of running pgbench three times are:
>> wal_sync_method=fdatasync:   tps = 43,179
>> wal_sync_method=pmem_drain:  tps = 45,254

While this pgbench was running, the utilization of 8 CPU cores(on which
the postgres server was runnnig) was about 800%, and the throughput of
WAL I/O was about 10 MB/sec. I thought that pgbench was not enough to put
heavy WAL I/O load on PMEM. So I made and ran the WAL I/O intensive test.

Do you know any good WAL I/O intensive benchmarks? DBT2?

<CA+TgmoawGN6Z8PcLKrMrGg99hF0028sFS2a1_VQEMDKcJjQDMQ@mail.gmail.com>
Wed, 17 Jan 2018 15:40:25 -0500Robert Haas <robertmhaas@gmail.com> wrote :
>> C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark)
>> C-5-1. pgbench
>> # numactl -N 1 pgbench -c 32 -j 8 -T 120 -M prepared [DB_NAME]
>>
>> The averages of running pgbench three times are:
>> wal_sync_method=fdatasync:   tps = 43,179
>> wal_sync_method=pmem_drain:  tps = 45,254
>
>What scale factor was used for this test?
This scale factor was 200.

# numactl -N 0 pgbench -s 200 -i [DB_NAME]


>Was the only non-default configuration setting wal_sync_method?  i.e.
>synchronous_commit=on?  No change to max_wal_size?
No, I used the following parameter in postgresql.conf to prevent
checkpoints from occurring while running the tests.

# - Settings -
wal_level = replica
fsync = on
synchronous_commit = on
wal_sync_method = pmem_drain
full_page_writes = on
wal_compression = off

# - Checkpoints -
checkpoint_timeout = 1d
max_wal_size = 20GB
min_wal_size = 20GB

>This seems like an exceedingly short test -- normally, for write
>tests, I recommend the median of 3 30-minute runs.  It also seems
>likely to be client-bound, because of the fact that jobs = clients/4.
>Normally I use jobs = clients or at least jobs = clients/2.
>

Thank you for your kind proposal. I did that.

# numactl -N 0 pgbench -s 200 -i [DB_NAME]
# numactl -N 1 pgbench -c 32 -j 32 -T 1800 -M prepared [DB_NAME]

The averages of running pgbench three times are:
wal_sync_method=fdatasync:   tps = 39,966
wal_sync_method=pmem_drain:  tps = 41,365


--
Yoshimi Ichiyanagi



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Robert Haas
Date:
On Fri, Jan 19, 2018 at 4:56 AM, Yoshimi Ichiyanagi
<ichiyanagi.yoshimi@lab.ntt.co.jp> wrote:
>>Was the only non-default configuration setting wal_sync_method?  i.e.
>>synchronous_commit=on?  No change to max_wal_size?
> No, I used the following parameter in postgresql.conf to prevent
> checkpoints from occurring while running the tests.

I think that you really need to include the checkpoints in the tests.
I would suggest setting max_wal_size and/or checkpoint_timeout so that
you reliably complete 2 checkpoints in a 30-minute test, and then do a
comparison on that basis.

> Do you know any good WAL I/O intensive benchmarks? DBT2?

pgbench is quite a WAL-intensive benchmark; it is much more
write-heavy than what most systems experience in real life, at least
in my experience.  Your comparison of DAX FS to DAX FS + PMDK is very
interesting, but in real life the bandwidth of DAX FS is already so
high -- and the latency so low -- that I think most real-world
workloads won't gain very much.  At least, that is my impression based
on internal testing EnterpriseDB did a few months back.  (Thanks to
Mithun and Kuntal for that work.)

That's not necessarily an argument against this patch, which by the
way I have not reviewed.  Even a 5% speedup on this kind of workload
is potentially worthwhile; everyone likes it when things go faster.
I'm just not convinced you can get very much more than that on a
realistic workload.  Of course, I might be wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Robert Haas
Date:
On Fri, Jan 19, 2018 at 9:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> That's not necessarily an argument against this patch, which by the
> way I have not reviewed.  Even a 5% speedup on this kind of workload
> is potentially worthwhile; everyone likes it when things go faster.
> I'm just not convinced you can get very much more than that on a
> realistic workload.  Of course, I might be wrong.

Oh, incidentally -- in our internal testing, we found that
wal_sync_method=open_datasync was significantly faster than
wal_sync_method=fdatasync.  You might find that open_datasync isn't
much different from pmem_drain, even though they're both faster than
fdatasync.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> Oh, incidentally -- in our internal testing, we found that
> wal_sync_method=open_datasync was significantly faster than
> wal_sync_method=fdatasync.  You might find that open_datasync isn't much
> different from pmem_drain, even though they're both faster than fdatasync.

That's interesting.  How fast was open_datasync in what environment (Linux distro/kernel version, HDD or SSD etc.)?

Is it now time to change the default setting to open_datasync on Linux, at least when O_DIRECT is not used (i.e. WAL
archivingor streaming replication is used)?
 

[Current port/linux.h]
/*
 * Set the default wal_sync_method to fdatasync.  With recent Linux versions,
 * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
 * perform better and (b) causes outright failures on ext4 data=journal
 * filesystems, because those don't support O_DIRECT.
 */
#define PLATFORM_DEFAULT_SYNC_METHOD    SYNC_METHOD_FDATASYNC


pg_test_fsync showed open_datasync is slower on my RHEL6 VM:

----------------------------------------ep
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      4276.373 ops/sec     234 usecs/op
        fdatasync                          4895.256 ops/sec     204 usecs/op
        fsync                              4797.094 ops/sec     208 usecs/op
        fsync_writethrough                              n/a
        open_sync                          4575.661 ops/sec     219 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      2243.680 ops/sec     446 usecs/op
        fdatasync                          4347.466 ops/sec     230 usecs/op
        fsync                              4337.312 ops/sec     231 usecs/op
        fsync_writethrough                              n/a
        open_sync                          2329.700 ops/sec     429 usecs/op
----------------------------------------ep

Regards
Takayuki Tsunakawa


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Robert Haas
Date:
On Tue, Jan 23, 2018 at 8:07 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: Robert Haas [mailto:robertmhaas@gmail.com]
>> Oh, incidentally -- in our internal testing, we found that
>> wal_sync_method=open_datasync was significantly faster than
>> wal_sync_method=fdatasync.  You might find that open_datasync isn't much
>> different from pmem_drain, even though they're both faster than fdatasync.
>
> That's interesting.  How fast was open_datasync in what environment (Linux distro/kernel version, HDD or SSD etc.)?
>
> Is it now time to change the default setting to open_datasync on Linux, at least when O_DIRECT is not used (i.e. WAL
archivingor streaming replication is used)?
 

I think open_datasync will be worse on systems where fsync() is
expensive -- it forces the data out to disk immediately, even if the
data doesn't need to be flushed immediately.  That's bad, because we
wait immediately when we could have deferred the wait until later and
maybe gotten the WAL writer to do the work in the background.  But it
might be better on systems where fsync() is basically free, because
there you might as well just get it out of the way immediately and not
leave something left to be done later.

This is just a guess, of course.  You didn't mention what the
underlying storage for your test was?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> I think open_datasync will be worse on systems where fsync() is expensive
> -- it forces the data out to disk immediately, even if the data doesn't
> need to be flushed immediately.  That's bad, because we wait immediately
> when we could have deferred the wait until later and maybe gotten the WAL
> writer to do the work in the background.  But it might be better on systems
> where fsync() is basically free, because there you might as well just get
> it out of the way immediately and not leave something left to be done later.
> 
> This is just a guess, of course.  You didn't mention what the underlying
> storage for your test was?

Uh, your guess was correct.  My file system was ext3, where fsync() writes all dirty buffers in page cache.

As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM volume with ext4 (mounted with options
noatime,nobarrier) on a PCIe flash memory.
 

5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                     50829.597 ops/sec      20 usecs/op
        fdatasync                         42094.381 ops/sec      24 usecs/op
        fsync                                          42209.972 ops/sec      24 usecs/op
        fsync_writethrough                            n/a
        open_sync                         48669.605 ops/sec      21 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                     26366.373 ops/sec      38 usecs/op
        fdatasync                         33922.725 ops/sec      29 usecs/op
        fsync                             32990.209 ops/sec      30 usecs/op
        fsync_writethrough                            n/a
        open_sync                         24326.249 ops/sec      41 usecs/op

What do you think about changing the default value of wal_sync_method on Linux in PG 11?  I can understand the concern
thatusers might hit performance degredation if they are using PostgreSQL on older systems.  But it's also mottainai
thatmany users don't notice the benefits of wal_sync_method = open_datasync on new systems.
 

Regards
Takayuki Tsunakawa



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Robert Haas
Date:
On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
>> This is just a guess, of course.  You didn't mention what the underlying
>> storage for your test was?
>
> Uh, your guess was correct.  My file system was ext3, where fsync() writes all dirty buffers in page cache.

Oh, ext3 is terrible.  I don't think you can do any meaningful
benchmark results on ext3.  Use ext4 or, if you prefer, xfs.

> As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM volume with ext4 (mounted with options
noatime,nobarrier) on a PCIe flash memory. 

So does that mean it was faster than your PMDK implementation?

> What do you think about changing the default value of wal_sync_method on Linux in PG 11?  I can understand the
concernthat users might hit performance degredation if they are using PostgreSQL on older systems.  But it's also
mottainaithat many users don't notice the benefits of wal_sync_method = open_datasync on new systems. 

Well, some day persistent memory may be a common enough storage
technology that such a change makes sense, but these days most people
have either SSD or spinning disks, where the change would probably be
a net negative.  It seems more like something we might think about
changing in PG 20 or PG 30.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
> > As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on
> a LVM volume with ext4 (mounted with options noatime, nobarrier) on a PCIe
> flash memory.
> 
> So does that mean it was faster than your PMDK implementation?

The PMDK patch is not mine, but is from people in NTT Lab.  I'm very curious about the comparison of open_datasync and
PMDK,too.
 


> > What do you think about changing the default value of wal_sync_method
> on Linux in PG 11?  I can understand the concern that users might hit
> performance degredation if they are using PostgreSQL on older systems.  But
> it's also mottainai that many users don't notice the benefits of
> wal_sync_method = open_datasync on new systems.
> 
> Well, some day persistent memory may be a common enough storage technology
> that such a change makes sense, but these days most people have either SSD
> or spinning disks, where the change would probably be a net negative.  It
> seems more like something we might think about changing in PG 20 or PG 30.

No, I'm not saying we should make the persistent memory mode the default.  I'm simply asking whether it's time to make
open_datasyncthe default setting.  We can write a notice in the release note for users who still use ext3 etc. on old
systems. If there's no objection, I'll submit a patch for the next CF.
 

Regards
Takayuki Tsunakawa




Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Robert Haas
Date:
On Thu, Jan 25, 2018 at 7:08 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> No, I'm not saying we should make the persistent memory mode the default.  I'm simply asking whether it's time to
makeopen_datasync the default setting.  We can write a notice in the release note for users who still use ext3 etc. on
oldsystems.  If there's no objection, I'll submit a patch for the next CF. 

Well, like I said, I think that will degrade performance for users of
SSDs or spinning disks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
Michael Paquier
Date:
On Thu, Jan 25, 2018 at 09:30:45AM -0500, Robert Haas wrote:
> On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
>>> This is just a guess, of course.  You didn't mention what the underlying
>>> storage for your test was?
>>
>> Uh, your guess was correct.  My file system was ext3, where fsync() writes all dirty buffers in page cache.
>
> Oh, ext3 is terrible.  I don't think you can do any meaningful
> benchmark results on ext3.  Use ext4 or, if you prefer, xfs.

Or to put it short, the lack of granular syncs in ext3 kills
performance for some workloads. Tomas Vondra's presentation on such
matters are a really cool read by the way:
https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zfs
(I would have loved seeing this presentation in live).
--
Michael

Attachment

RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com]> On Thu, Jan 25, 2018 at 7:08 PM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
> > No, I'm not saying we should make the persistent memory mode the default.
> I'm simply asking whether it's time to make open_datasync the default
> setting.  We can write a notice in the release note for users who still
> use ext3 etc. on old systems.  If there's no objection, I'll submit a patch
> for the next CF.
> 
> Well, like I said, I think that will degrade performance for users of SSDs
> or spinning disks.


As I showed previously, regular file writes on PCIe flash, *not writes using PMDK on persistent memory*, was 20% faster
withopen_datasync than with fdatasync.
 

In addition, regular file writes on HDD with ext4 was also 10% faster:

--------------------------------------------------
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      3408.905 ops/sec     293 usecs/op
        fdatasync                          3111.621 ops/sec     321 usecs/op
        fsync                              3609.940 ops/sec     277 usecs/op
        fsync_writethrough                              n/a
        open_sync                          3356.362 ops/sec     298 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      1892.157 ops/sec     528 usecs/op
        fdatasync                          3284.278 ops/sec     304 usecs/op
        fsync                              3066.655 ops/sec     326 usecs/op
        fsync_writethrough                              n/a
        open_sync                          1853.415 ops/sec     540 usecs/op
--------------------------------------------------


And you said open_datasync was significantly faster than fdatasync.  Could you show your results?  What device and
filesystemdid you use?
 

Regards
Takayuki Tsunakawa



RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael.paquier@gmail.com]
> Or to put it short, the lack of granular syncs in ext3 kills performance
> for some workloads. Tomas Vondra's presentation on such matters are a really
> cool read by the way:
> https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zf
> s

Yeah, I saw this recently, too.  That was cool.

Regards
Takayuki Tsunakawa





Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Robert Haas
Date:
On Thu, Jan 25, 2018 at 8:32 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> As I showed previously, regular file writes on PCIe flash, *not writes using PMDK on persistent memory*, was 20%
fasterwith open_datasync than with fdatasync.
 

If I understand correctly, those results are all just pg_test_fsync
results.  That's not reflective of what will happen when the database
is actually running.  When you use open_sync or open_datasync, you
force WAL write and WAL flush to happen simultaneously, instead of
letting the WAL flush be delayed.

> And you said open_datasync was significantly faster than fdatasync.  Could you show your results?  What device and
filesystemdid you use?
 

I don't have the results handy at the moment.  We found it to be
faster on a database benchmark where the WAL was stored on an NVRAM
device.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> If I understand correctly, those results are all just pg_test_fsync results.
> That's not reflective of what will happen when the database is actually
> running.  When you use open_sync or open_datasync, you force WAL write and
> WAL flush to happen simultaneously, instead of letting the WAL flush be
> delayed.

Yes, that's pg_test_fsync output.  Isn't pg_test_fsync the tool to determine the value for wal_sync_method?  Is this
manualmisleading?
 

https://www.postgresql.org/docs/devel/static/pgtestfsync.html
--------------------------------------------------
pg_test_fsync - determine fastest wal_sync_method for PostgreSQL

pg_test_fsync is intended to give you a reasonable idea of what the fastest wal_sync_method is on your specific system,
aswell as supplying diagnostic information in the event of an identified I/O problem.
 
--------------------------------------------------


Anyway, I'll use pgbench, and submit a patch if open_datasync is better than fdatasync.  I guess the current tweak of
makingfdatasync the default is a holdover from the era before ext4 and XFS became prevalent.
 


> I don't have the results handy at the moment.  We found it to be faster
> on a database benchmark where the WAL was stored on an NVRAM device.

Oh, NVRAM.  Interesting.  Then I'll try open_datasync/fdatasync comparison on HDD and SSD/PCie flash with pgbench.

Regards
Takayuki Tsunakawa



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Robert Haas
Date:
On Thu, Jan 25, 2018 at 8:54 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> Yes, that's pg_test_fsync output.  Isn't pg_test_fsync the tool to determine the value for wal_sync_method?  Is this
manualmisleading?
 

Hmm.  I hadn't thought about it as misleading, but now that you
mention it, I'd say that it probably is.  I suspect that there should
be a disclaimer saying that the fastest WAL sync method in terms of
ops/second is not necessarily the one that will deliver the best
database performance, and mention the issues around open_sync and
open_datasync specifically.  But let's see what your testing shows;
I'm talking based on now-fairly-old experience with this and a passing
familiarity with the relevant source code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Yoshimi Ichiyanagi
Date:
<CA+TgmoZygQO3EC4mMdf-b=UuY3HZz6+-Y2w5_s9bLtH4NPw6Bg@mail.gmail.com>
Fri, 19 Jan 2018 09:42:25 -0500Robert Haas <robertmhaas@gmail.com> wrote
 :
>
>I think that you really need to include the checkpoints in the tests.
>I would suggest setting max_wal_size and/or checkpoint_timeout so that
>you reliably complete 2 checkpoints in a 30-minute test, and then do a
>comparison on that basis.

Experimental setup:
-------------------------
Server: HP ProLiant DL360 Gen9
CPU:    Xeon E5-2667 v4 (3.20GHz); 2 processors(without HT)
DRAM:   DDR4-2400; 32 GiB/processor
        (8GiB/socket x 4 sockets/processor) x 2 processors
NVDIMM: DDR4-2133; 32 GiB/processor
        (node 0: 8GiB/socket x 2 sockets/processor,
         node 1: 8GiB/socket x 6 sockets/processor)
HDD:    Seagate Constellation2 2.5inch SATA 3.0. 6Gb/s 1TB 7200rpm x 1
SATA-SSD: Crucial_CT500MX200SSD1 (SATA 3.2, SATA 6Gb/s)
OS:       Ubuntu 16.04, linux-4.12
DAX FS:   ext4
PMDK:     master(at)Aug 30, 2017
PostgreSQL: master
Note: I bound the postgres processes to one NUMA node,
      and the benchmarks to other NUMA node.
-------------------------

postgresql.conf
-------------------------
# - Settings -
wal_level = replica
fsync = on
synchronous_commit = on
wal_sync_method = pmem_drain/fdatasync/open_datasync
full_page_writes = on
wal_compression = off

# - Checkpoints -
checkpoint_timeout = 12min
max_wal_size = 20GB
min_wal_size = 20GB
-------------------------

Executed commands:
--------------------------------------------------------------------
# numactl -N 1 pg_ctl start -D [PG_DIR] -l [LOG_FILE]
# numactl -N 0 pgbench -s 200 -i [DB_NAME]
# numactl -N 0 pgbench -c 32 -j 32 -T 1800 -r [DB_NAME] -M prepared
--------------------------------------------------------------------

The results:
--------------------------------------------------------------------
A) Applied the patches to PG src, and compiled PG with libpmem
B) Applied the patches to PG src, and compiled PG without libpmem
C) Original PG

The averages of running pgbench three times on *PMEM* are:
A)
wal_sync_method = pmem_drain      tps = 41660.42524
wal_sync_method = open_datasync   tps = 39913.49897
wal_sync_method = fdatasync       tps = 39900.83396

C)
wal_sync_method = open_datasync   tps = 40335.50178
wal_sync_method = fdatasync       tps = 40649.57772


The averages of running pgbench three times on *SATA-SSD* are:
B)
wal_sync_method = open_datasync   tps = 7224.07146
wal_sync_method = fdatasync       tps = 7222.19177

C)
wal_sync_method = open_datasync   tps = 7258.79093
wal_sync_method = fdatasync       tps = 7263.19878
--------------------------------------------------------------------

From the above results, it show that wal_sync_method=pmem_drain was
about faster than wal_sync_method=open_datasync/fdatasync.
When pgbench ran on SATA-SSD, wal_sync_method=fdatasync was as fast
as wal_sync_method=open_datasync.


>> Do you know any good WAL I/O intensive benchmarks? DBT2?
>
>pgbench is quite a WAL-intensive benchmark; it is much more
>write-heavy than what most systems experience in real life, at least
>in my experience.  Your comparison of DAX FS to DAX FS + PMDK is very
>interesting, but in real life the bandwidth of DAX FS is already so
>high -- and the latency so low -- that I think most real-world
>workloads won't gain very much.  At least, that is my impression based
>on internal testing EnterpriseDB did a few months back.  (Thanks to
>Mithun and Kuntal for that work.)

In the near future, many physical devices will send sensing data
(IoT might allow devices to exhaust tens Giga network bandwidth).
The amount of data inserted in the DB will significantly increase.
I think that PMEM will be needed for use cases like IoT.


<CA+TgmobDO4qj2nMLdm2Dv5VRT8cVQjv7kftsS_P-kNpNw=TRug@mail.gmail.com>
Thu, 25 Jan 2018 09:30:45 -0500Robert Haas <robertmhaas@gmail.com> wrote
 :
>Well, some day persistent memory may be a common enough storage
>technology that such a change makes sense, but these days most people
>have either SSD or spinning disks, where the change would probably be
>a net negative.  It seems more like something we might think about
>changing in PG 20 or PG 30.
>

Oracle and Microsoft SQL Server suported PMEM [1][2].
I think it is not too early for PostgreSQL to support PMEM.

[1] http://dbheartbeat.blogspot.jp/2017/11/doag-2017-oracle-18c-dbim-oracle.htm
[2]
https://www.snia.org/sites/default/files/PM-Summit/2018/presentations/06_PM_Summit_2018_Talpey-Final_Post-CORRECTED.pdf

-- 
Yoshimi Ichiyanagi



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Robert Haas
Date:
On Tue, Jan 30, 2018 at 3:37 AM, Yoshimi Ichiyanagi
<ichiyanagi.yoshimi@lab.ntt.co.jp> wrote:
> Oracle and Microsoft SQL Server suported PMEM [1][2].
> I think it is not too early for PostgreSQL to support PMEM.

I agree; it's good to have the option available for those who have
access to the hardware.

If you haven't added your patch to the next CommitFest, please do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Yoshimi Ichiyanagi
Date:
>On Tue, Jan 30, 2018 at 3:37 AM, Yoshimi Ichiyanagi
><ichiyanagi.yoshimi@lab.ntt.co.jp> wrote:
>> Oracle and Microsoft SQL Server suported PMEM [1][2].
>> I think it is not too early for PostgreSQL to support PMEM.
>
>I agree; it's good to have the option available for those who have
>access to the hardware.
>
>If you haven't added your patch to the next CommitFest, please do so.

Thank you for your time.

I added my patches to the CommitFest 2018-3.
https://commitfest.postgresql.org/17/1485/

Oh by the way, we submitted this proposal(Introducing PMDK into
PostgreSQL) to PGcon2018.
If our proposal is accepted and you have time, please listen to
our presentation.

-- 
Yoshimi Ichiyanagi
Mailto : ichiyanagi.yoshimi@lab.ntt.co.jp



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
Andres Freund
Date:
On 2018-02-05 09:59:25 +0900, Yoshimi Ichiyanagi wrote:
> I added my patches to the CommitFest 2018-3.
> https://commitfest.postgresql.org/17/1485/

Unfortunately this is the last CF for the v11 development cycle. This is
a major project submitted late for v11, there's been no code level
review, the goals aren't agreed upon yet, etc. So I'd unfortunately like
to move this to the next CF?

Greetings,

Andres Freund


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
Heikki Linnakangas
Date:
On 16/01/18 15:00, Yoshimi Ichiyanagi wrote:
> Hi.
> 
> These patches enable to use Persistent Memory Development Kit(PMDK)[1]
> for reading/writing WAL logs on persistent memory(PMEM).
> PMEM is next generation storage and it has a number of nice features:
> fast, byte-addressable and non-volatile.

Interesting. How does this compare with using good old mmap()? I think 
just doing that would allow eliminating much of the complexity around 
managing the shared_buffers. And if the OS is smart about persistent 
memory (I don't know what the state of the art on that is), presumably 
msync() and fsync() on an file that lives in persistent memory is 
lightning fast.

- Heikki


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Yoshimi Ichiyanagi
Date:
<20180301103641.tudam4mavba3god7@alap3.anarazel.de>
Thu, 1 Mar 2018 02:36:41 -0800Andres Freund <andres@anarazel.de> wrote :

Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent 
memory
>On 2018-02-05 09:59:25 +0900, Yoshimi Ichiyanagi wrote:
>> I added my patches to the CommitFest 2018-3.
>> https://commitfest.postgresql.org/17/1485/
>
>Unfortunately this is the last CF for the v11 development cycle. This is
>a major project submitted late for v11, there's been no code level
>review, the goals aren't agreed upon yet, etc. So I'd unfortunately like
>to move this to the next CF?

I get it. I modified the status to "move to next CF".

-- 
Yoshimi Ichiyanagi
NTT laboratories



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
Heikki Linnakangas
Date:
On 01/03/18 12:40, Heikki Linnakangas wrote:
> On 16/01/18 15:00, Yoshimi Ichiyanagi wrote:
>> These patches enable to use Persistent Memory Development Kit(PMDK)[1]
>> for reading/writing WAL logs on persistent memory(PMEM).
>> PMEM is next generation storage and it has a number of nice features:
>> fast, byte-addressable and non-volatile.
> 
> Interesting. How does this compare with using good old mmap()? I think
> just doing that would allow eliminating much of the complexity around
> managing the shared_buffers. And if the OS is smart about persistent
> memory (I don't know what the state of the art on that is), presumably
> msync() and fsync() on an file that lives in persistent memory is
> lightning fast.

I briefly looked at the docs at pmem.io. pmem_map_file() uses mmap() 
under the hood, but it does some extra checks to test if the files is on 
a persistent memory device, and makes a note of it.

I think the way forward with this patch would be to map WAL segments 
with plain old mmap(), and use msync(). If that's faster than the status 
quo, great. If not, it would still be a good stepping stone for actually 
using PMDK. If nothing else, it would provide a way to test most of the 
code paths, without actually having a persistent memory device, or 
libpmem. The examples at http://pmem.io/pmdk/libpmem/ actually sugest 
doing exactly that: use libpmem to map a file to memory, and check if it 
lives on persistent memory using libpmem's pmem_is_pmem() function. If 
it returns yes, use pmem_drain(), if it return false, fall back to using 
msync().

- Heikki


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Yoshimi Ichiyanagi
Date:
I'm sorry for the delay in replying your mail.

<91411837-8c65-bf7d-7ca3-d69bdcb4968a@iki.fi>
Thu, 1 Mar 2018 18:40:05 +0800Heikki Linnakangas <hlinnaka@iki.fi> wrote
 :
>Interesting. How does this compare with using good old mmap()?

The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
alignment, since it could reduce the number of page faults. 
In addition, libpmem's pmem_memcpy_nodrain() is the function
to copy data using single instruction, multiple data(SIMD) instructions
and NT store instructions(MOVNT).
As a result, using these APIs is faster than using old mmap()/memcpy().

Please see the PGCon2018 presentation[1] for the details.

[1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf


<83eafbfd-d9c5-6623-2423-7cab1be3888c@iki.fi>
Fri, 20 Jul 2018 23:18:05 +0300Heikki Linnakangas <hlinnaka@iki.fi> 
wrote :
>I think the way forward with this patch would be to map WAL segments 
>with plain old mmap(), and use msync(). If that's faster than the status 
>quo, great. If not, it would still be a good stepping stone for actually 
>using PMDK. 

I think so too.

I wrote this patch to replace read/write syscalls with libpmem's
API only. I believe that PMDK can make the current PostgreSQL faster.


> If nothing else, it would provide a way to test most of the 
>code paths, without actually having a persistent memory device, or 
>libpmem. The examples at http://pmem.io/pmdk/libpmem/ actually sugest 
>doing exactly that: use libpmem to map a file to memory, and check if it 
>lives on persistent memory using libpmem's pmem_is_pmem() function. If 
>it returns yes, use pmem_drain(), if it return false, fall back to using 
>msync().

When PMEM_IS_PMEM_FORCE(the environment variable[2]) is set to 1,
pmem_is_pmem() return yes.

Linux 4.15 and more supported MAP_SYNC and MAP_SHARED_VALIDATE of
mmap() flags to check if the mapped file is stored on PMEM.
An application that used both flags in its mmap() call can be sure
that MAP_SYNC is actually supported by both the kernel and
the filesystem that the mapped file is stored in[3].
But pmem_is_pmem() doesn't support this mechanism for now.

[2] http://pmem.io/pmdk/manpages/linux/v1.4/libpmem/libpmem.7.html
[3] https://lwn.net/Articles/758594/ 

--
Yoshimi Ichiyanagi
NTT Software Innovation Center
e-mail : ichiyanagi.yoshimi@lab.ntt.co.jp



Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
Michael Paquier
Date:
On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:
> The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
> alignment, since it could reduce the number of page faults.
> In addition, libpmem's pmem_memcpy_nodrain() is the function
> to copy data using single instruction, multiple data(SIMD) instructions
> and NT store instructions(MOVNT).
> As a result, using these APIs is faster than using old mmap()/memcpy().
>
> Please see the PGCon2018 presentation[1] for the details.
>
> [1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf

So you say that this represents a 3% gain based on the presentation?
That may be interesting to dig into it.  Could you provide fresher
performance numbers?  I am moving this patch to the next CF 2018-10 for
now, waiting for input from the author.
--
Michael

Attachment

Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Dmitry Dolgov
Date:
> On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:
> > The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
> > alignment, since it could reduce the number of page faults.
> > In addition, libpmem's pmem_memcpy_nodrain() is the function
> > to copy data using single instruction, multiple data(SIMD) instructions
> > and NT store instructions(MOVNT).
> > As a result, using these APIs is faster than using old mmap()/memcpy().
> >
> > Please see the PGCon2018 presentation[1] for the details.
> >
> > [1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf
>
> So you say that this represents a 3% gain based on the presentation?
> That may be interesting to dig into it.  Could you provide fresher
> performance numbers?  I am moving this patch to the next CF 2018-10 for
> now, waiting for input from the author.

Unfortunately, the patch has some conflicts now, so probably not only fresher
performance numbers are necessary, but also a rebased version.


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Dmitry Dolgov
Date:
> On Thu, Nov 29, 2018 at 6:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:
> > > The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
> > > alignment, since it could reduce the number of page faults.
> > > In addition, libpmem's pmem_memcpy_nodrain() is the function
> > > to copy data using single instruction, multiple data(SIMD) instructions
> > > and NT store instructions(MOVNT).
> > > As a result, using these APIs is faster than using old mmap()/memcpy().
> > >
> > > Please see the PGCon2018 presentation[1] for the details.
> > >
> > > [1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf
> >
> > So you say that this represents a 3% gain based on the presentation?
> > That may be interesting to dig into it.  Could you provide fresher
> > performance numbers?  I am moving this patch to the next CF 2018-10 for
> > now, waiting for input from the author.
>
> Unfortunately, the patch has some conflicts now, so probably not only fresher
> performance numbers are necessary, but also a rebased version.

I believe the idea behind this patch is quite important (thanks to CMU DG for
inspiring lectures), so I decided to put some efforts and rebase it to prevent
from rotting. At the same time I have a vague impression that the patch itself
suggests quite narrow way of using of PMDK.

> On 01/03/18 12:40, Heikki Linnakangas wrote:
> > On 16/01/18 15:00, Yoshimi Ichiyanagi wrote:
> >> These patches enable to use Persistent Memory Development Kit(PMDK)[1]
> >> for reading/writing WAL logs on persistent memory(PMEM).
> >> PMEM is next generation storage and it has a number of nice features:
> >> fast, byte-addressable and non-volatile.
> >
> > Interesting. How does this compare with using good old mmap()?

E.g. byte-addressability is not used here at all, and it's probably one of the
most cool properties, when we write not a block/page, but a small amount of
data and flush it using PMDK.

Attachment

Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
Heikki Linnakangas
Date:
On 10/12/2018 23:37, Dmitry Dolgov wrote:
>> On Thu, Nov 29, 2018 at 6:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>>
>>> On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier <michael@paquier.xyz> wrote:
>>>
>>> On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:
>>>> The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
>>>> alignment, since it could reduce the number of page faults.
>>>> In addition, libpmem's pmem_memcpy_nodrain() is the function
>>>> to copy data using single instruction, multiple data(SIMD) instructions
>>>> and NT store instructions(MOVNT).
>>>> As a result, using these APIs is faster than using old mmap()/memcpy().
>>>>
>>>> Please see the PGCon2018 presentation[1] for the details.
>>>>
>>>> [1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf
>>>
>>> So you say that this represents a 3% gain based on the presentation?
>>> That may be interesting to dig into it.  Could you provide fresher
>>> performance numbers?  I am moving this patch to the next CF 2018-10 for
>>> now, waiting for input from the author.
>>
>> Unfortunately, the patch has some conflicts now, so probably not only fresher
>> performance numbers are necessary, but also a rebased version.
> 
> I believe the idea behind this patch is quite important (thanks to CMU DG for
> inspiring lectures), so I decided to put some efforts and rebase it to prevent
> from rotting. At the same time I have a vague impression that the patch itself
> suggests quite narrow way of using of PMDK.

Thanks.

To re-iterate what I said earlier in this thread, I think the next step 
here is to write a patch that modifies xlog.c to use plain old 
mmap()/msync() to memory-map the WAL files, to replace the WAL buffers. 
Let's see what the performance of that is, with or without NVM hardware. 
I think that might actually make the code simpler. There's a bunch of 
really hairy code around locking the WAL buffers, which could be made 
simpler if each backend memory-mapped the WAL segment files independently.

One thing to watch out for, is that if you read() a file, and there's an 
I/O error, you have a chance to ereport() it. If you try to read from a 
memory-mapped file, and there's an I/O error, the process is killed with 
SIGBUS. So I think we have to be careful with using memory-mapped I/O 
for reading files. But for writing WAL files, it seems like a good fit.

Once we have a reliable mmap()/msync() implementation running, it should 
be straightforward to change it to use MAP_SYNC and the special CPU 
instructions for the flushing.

- Heikki


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
Andres Freund
Date:
Hi,

On 2019-01-23 18:45:42 +0200, Heikki Linnakangas wrote:
> To re-iterate what I said earlier in this thread, I think the next step here
> is to write a patch that modifies xlog.c to use plain old mmap()/msync() to
> memory-map the WAL files, to replace the WAL buffers. Let's see what the
> performance of that is, with or without NVM hardware. I think that might
> actually make the code simpler. There's a bunch of really hairy code around
> locking the WAL buffers, which could be made simpler if each backend
> memory-mapped the WAL segment files independently.
> 
> One thing to watch out for, is that if you read() a file, and there's an I/O
> error, you have a chance to ereport() it. If you try to read from a
> memory-mapped file, and there's an I/O error, the process is killed with
> SIGBUS. So I think we have to be careful with using memory-mapped I/O for
> reading files. But for writing WAL files, it seems like a good fit.
> 
> Once we have a reliable mmap()/msync() implementation running, it should be
> straightforward to change it to use MAP_SYNC and the special CPU
> instructions for the flushing.

FWIW, I don't think we should go there as the sole implementation. I'm
fairly convinced that we're going to need to go to direct-IO in more
cases here, and that'll not work well with mmap.  I think this'd be a
worthwhile experiment, but I'm doubtful it'd end up simplifying our
code.

Greetings,

Andres Freund


RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
"Takashi Menjo"
Date:
Hello,


On behalf of Yoshimi, I rebased the patchset onto the latest master
(e3565fd6).
Please see the attachment. It also includes an additional bug fix (in patch
0002) 
about temporary filename.

Note that PMDK 1.4.2+ supports MAP_SYNC and MAP_SHARED_VALIDATE flags, 
so please use a new version of PMDK when you test. The latest version is
1.5.


Heikki Linnakangas wrote:
> To re-iterate what I said earlier in this thread, I think the next step 
> here is to write a patch that modifies xlog.c to use plain old 
> mmap()/msync() to memory-map the WAL files, to replace the WAL buffers.

Sorry but my new patchset still uses PMDK, because PMDK is supported on
Linux 
_and Windows_, and I think someone may want to test this patchset on
Windows...


Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center
<menjo.takashi@lab.ntt.co.jp>


Attachment

Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
Peter Eisentraut
Date:
On 25/01/2019 09:52, Takashi Menjo wrote:
> Heikki Linnakangas wrote:
>> To re-iterate what I said earlier in this thread, I think the next step 
>> here is to write a patch that modifies xlog.c to use plain old 
>> mmap()/msync() to memory-map the WAL files, to replace the WAL buffers.
> Sorry but my new patchset still uses PMDK, because PMDK is supported on
> Linux 
> _and Windows_, and I think someone may want to test this patchset on
> Windows...

When you manage the WAL (or perhaps in the future relation files)
through PMDK, is there still a file system view of it somewhere, for
browsing, debugging, and for monitoring tools?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
"Takashi Menjo"
Date:
Hi,

Peter Eisentraut wrote:
> When you manage the WAL (or perhaps in the future relation files)
> through PMDK, is there still a file system view of it somewhere, for
> browsing, debugging, and for monitoring tools?

First, I assume that our patchset is used with a filesystem that supports
direct access (DAX) feature, and I test it with ext4 on Linux.  You can cd
into pg_wal directory created by initdb -X pg_wal on such a filesystem, and
ls WAL segment files managed by PMDK at runtime.

For each PostgreSQL-specific tool, perhaps yes, but I have not tested yet.
At least, pg_waldump looks working as before.

Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center
<menjo.takashi@lab.ntt.co.jp>





RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
"Takashi Menjo"
Date:
Hi,

Sorry but I found that the patchset v2 had a bug in managing WAL segment
file offset.  I fixed it and updated a patchset as v3 (attached).

Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center
<menjo.takashi@lab.ntt.co.jp>


Attachment

Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory

From
Peter Eisentraut
Date:
On 30/01/2019 07:16, Takashi Menjo wrote:
> Sorry but I found that the patchset v2 had a bug in managing WAL segment
> file offset.  I fixed it and updated a patchset as v3 (attached).

I'm concerned with how this would affect the future maintenance of this
code.  You are introducing a whole separate code path for PMDK beside
the normal file path (and it doesn't seem very well separated either).
Now everyone who wants to do some surgery in the WAL code needs to take
that into account.  And everyone who wants to do performance work in the
WAL code needs to check that the PMDK path doesn't regress.  AFAICT,
this hardware isn't very popular at the moment, so it would be very hard
to peer review any work in this area.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
"Takashi Menjo"
Date:
Peter Eisentraut wrote:
> I'm concerned with how this would affect the future maintenance of this
> code.  You are introducing a whole separate code path for PMDK beside
> the normal file path (and it doesn't seem very well separated either).
> Now everyone who wants to do some surgery in the WAL code needs to take
> that into account.  And everyone who wants to do performance work in the
> WAL code needs to check that the PMDK path doesn't regress.  AFAICT,
> this hardware isn't very popular at the moment, so it would be very hard
> to peer review any work in this area.

Thank you for your comment.  It is reasonable that you are concerned with
maintainability.  Our patchset still lacks of it.  I will consider about
that when I submit a next update.  (It may take a long time, so please be
patient...)


Regards,
Takashi

-- 
Takashi Menjo - NTT Software Innovation Center
<menjo.takashi@lab.ntt.co.jp>




Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From
Takashi Menjo
Date:
Dear hackers,

I rebased my old patchset.  It would be good to compare this v4 patchset to non-volatile WAL buffer's one [1].


Regards,
Takashi

--
Takashi Menjo <takashi.menjo@gmail.com>
Attachment