Thread: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
[HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From
Yoshimi Ichiyanagi
Date:
Hi. These patches enable to use Persistent Memory Development Kit(PMDK)[1] for reading/writing WAL logs on persistent memory(PMEM). PMEM is next generation storage and it has a number of nice features: fast, byte-addressable and non-volatile. Using pgbench which is a PostgreSQL general benchmark, the postgres server to which the patches is applied is about 5% faster than original server. And using my insert benchmark, it is up to 90% faster than original one. I will describe these details later. This e-mail describes the following: A) About PMDK B) About the patches C) The way of running benchmarks using the patches, and the results A) About PMDK PMDK provides the functions to allow an application to directly access PMEM without going through the kernel as a memory for the purpose of high-speed access to PMEM by the application. The following APIs are available through PMDK. A-1. APIs to open a file on PMEM, to create a file on PMEM, and to map a file on PMEM to virtual addresses A-2. APIs to read/write data from and to a file on PMEM A-1. APIs to open a file on PMEM, to create a file on PMEM, and to map a file on PMEM to virtual addresses PMDK provides these APIs using DAX filesystem(DAX FS)[2] feature. DAX FS is a PMEM-aware file system which allows direct access to PMEM without using the kernel page caches. A file in DAX FS can be mapped to memory using standard calls like mmap() on Linux. Furthermore by mapping the file on PMEM to virtual addresses(and after any initial minor page faults that may be required to create the mappings in the MMU), the applications can access PMEM using CPU load/store instructions instead of read/write system calls. A-2. APIs to read/write data from and to a file on PMEM PMDK provides the APIs like memcpy() to copy data to PMEM using single instruction, multiple data(SIMD) instructions[3] and NT store instructions[4]. These instructions improve the performance to copy data to PMEM. As a result, using these APIs is faster than using read/write system calls. [1] http://pmem.io/pmdk/ [2] https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf [3] SIMD: SIMD is the instruction operates on all loaded data in a single operation. If the SIMD system loads eight data into registers at once, the store operation to PMEM will happen to all eight values at the same time. [4] NT store instructions: NT store instructions bypass the CPU cache, so using these instructions does not require a flush. B) About the patches Changes by the patches: 0001-Add-configure-option-for-PMDK.patch: - Added "--with-libpmem" configure option to execute I/O with PMDK library 0002-Read-write-WAL-files-using-PMDK.patch: - Added PMDK implementation for WAL I/O operations - Added "pmem-drain" to the wal_sync_method parameter list to write logs synchronously on PMEM 0003-Walreceiver-WAL-IO-using-PMDK.patch: - Added PMDK implementation for Walreceiver of secondary server processes C) The way of running benchmarks using the patches, and the results It's the following: Experimental setup Server: HP ProLiant DL360 Gen9 CPU: Xeon E5-2667 v4 (3.20GHz); 2 processors(without HT) DRAM: DDR4-2400; 32 GiB/processor (8GiB/socket x 4 sockets/processor) x 2 processors NVDIMM: DDR4-2133; 32 GiB/processor (8GiB/socket x 4 sockets/processor) x 2 processors HDD: Seagate Constellation2 2.5inch SATA 3.0. 6Gb/s 1TB 7200rpm x 1 OS: Ubuntu 16.04, linux-4.12 DAX FS: ext4 NVML: master@Aug 30, 2017 PostgreSQL: master Note: I bound the postgres processes to one NUMA node, and the benchmarks to other NUMA node. C-1. Configuring PMEM for using as a block device # ndctl list # ndctl create-namespace -f -e namespace0.0 --mode=memory -M dev C-2. Creating a file system on PMEM, and mounting it with DAX # mkfs.ext4 /dev/pmem0 # mount -t ext4 -o dax /dev/pmem0 /mnt/pmem0 C-3. Setting PMEM_IS_PMEM_FORCE to determine if the WAL files is stored on PMEM Note: If this environment variable is not set, postgres processes are not started. # export PMEM_IS_PMEM_FORCE=1 C-4. Installing PostgreSQL Note: There are 3 important things in installing PostgreSQL. a. Executing "./configure --with-libpmem" to link libpmem b. Setting WAL directory on PMEM c. Modifying wal_sync_method parameter in postgresql.conf from fdatasync to pmem_drain # cd /path/to/[PG_source dir] # ./configure --with-libpmem # make && make install # initdb /path/to/PG_DATA -X /mnt/pmem0/path/to/[PG_WAL dir] # cat /path/to/PG_DATA/postgresql.conf | sed -e s/#wal_sync_method\ =\ fsync/wal_sync_method\ =\ pmem_drain/ > /path/to/PG_DATA/postgresql.conf. tmp # mv /path/to/PG_DATA/postgresql.conf.tmp /path/to/PG_DATA/postgresql.conf # pg_ctl start -D /path/to/PG_DATA # created [DB_NAME] C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark) C-5-1. pgbench # numactl -N 1 pgbech -c 32 -j 8 -T 120 -M prepared [DB_NAME] The averages of running pgbench three times are: wal_sync_method=fdatasync: tps = 43,179 wal_sync_method=pmem_drain: tps = 45,254 C-5-2. pclinet_thread: my insert benchmark Preparation CREATE TABLE [TABLE_NAME] (id int8, value text); ALTER TABLE [TABLE_NAME] ALTER value SET STORAGE external; PREPARE insert_sql (int8) AS INSERT INTO %s (id, value) values ($1, ' [1K_data]'); Execution BEGIN; EXECUTE insert_sql(%lld); COMMIT; Note: I ran this quer 5M times with 32 threads. # ./pclient_thread Invalid Arguments: Usage: ./pclient_thread [The number of threads] [The number to insert tuples] [data size(KB)] # numactl -N 1 ./pclient_thread 32 5242880 1 The averages of running this benchmark three times are: wal_sync_method=fdatasync: tps = 67,780 wal_sync_method=pmem_drain: tps = 131,962 -- Yoshimi Ichiyanagi
Attachment
On Tue, Jan 16, 2018 at 2:00 AM, Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp> wrote: > Using pgbench which is a PostgreSQL general benchmark, the postgres server > to which the patches is applied is about 5% faster than original server. > And using my insert benchmark, it is up to 90% faster than original one. > I will describe these details later. Interesting. But your insert benchmark looks highly artificial... in real life, you would not insert the same long static string 160 million times. Or if you did, you would use COPY or INSERT .. SELECT. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 16, 2018 at 2:00 AM, Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp> wrote: > C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark) > C-5-1. pgbench > # numactl -N 1 pgbech -c 32 -j 8 -T 120 -M prepared [DB_NAME] > > The averages of running pgbench three times are: > wal_sync_method=fdatasync: tps = 43,179 > wal_sync_method=pmem_drain: tps = 45,254 What scale factor was used for this test? Was the only non-default configuration setting wal_sync_method? i.e. synchronous_commit=on? No change to max_wal_size? This seems like an exceedingly short test -- normally, for write tests, I recommend the median of 3 30-minute runs. It also seems likely to be client-bound, because of the fact that jobs = clients/4. Normally I use jobs = clients or at least jobs = clients/2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From
Yoshimi Ichiyanagi
Date:
Thank you for your reply. <CA+TgmobUrKBWgOa8x=mbW4Cmsb=NeV8Egf+RSLp7XiCAjHdmgw@mail.gmail.com> Wed, 17 Jan 2018 15:29:11 -0500Robert Haas <robertmhaas@gmail.com> wrote : >> Using pgbench which is a PostgreSQL general benchmark, the postgres server >> to which the patches is applied is about 5% faster than original server. >> And using my insert benchmark, it is up to 90% faster than original one. >> I will describe these details later. > >Interesting. But your insert benchmark looks highly artificial... in >real life, you would not insert the same long static string 160 >million times. Or if you did, you would use COPY or INSERT .. SELECT. I made this benchmark in order to put very heavy WAL I/O load on PMEM. PMEM is very fast. I ran the micro-benchmark test like fio on PMEM. This workload involved 8K Bytes-block synchronous sequential writes, and the total write size was 40G Bytes. The micro-benchmark result was the following. Using DAX FS(like fdatasync): 5,559 MB/sec Using DAX FS and PMDK(like pmem_drain): 13,177 MB/sec Using pgbench, the postgres server to which my patches were applied was only 5% faster than the original server. >> The averages of running pgbench three times are: >> wal_sync_method=fdatasync: tps = 43,179 >> wal_sync_method=pmem_drain: tps = 45,254 While this pgbench was running, the utilization of 8 CPU cores(on which the postgres server was runnnig) was about 800%, and the throughput of WAL I/O was about 10 MB/sec. I thought that pgbench was not enough to put heavy WAL I/O load on PMEM. So I made and ran the WAL I/O intensive test. Do you know any good WAL I/O intensive benchmarks? DBT2? <CA+TgmoawGN6Z8PcLKrMrGg99hF0028sFS2a1_VQEMDKcJjQDMQ@mail.gmail.com> Wed, 17 Jan 2018 15:40:25 -0500Robert Haas <robertmhaas@gmail.com> wrote : >> C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark) >> C-5-1. pgbench >> # numactl -N 1 pgbench -c 32 -j 8 -T 120 -M prepared [DB_NAME] >> >> The averages of running pgbench three times are: >> wal_sync_method=fdatasync: tps = 43,179 >> wal_sync_method=pmem_drain: tps = 45,254 > >What scale factor was used for this test? This scale factor was 200. # numactl -N 0 pgbench -s 200 -i [DB_NAME] >Was the only non-default configuration setting wal_sync_method? i.e. >synchronous_commit=on? No change to max_wal_size? No, I used the following parameter in postgresql.conf to prevent checkpoints from occurring while running the tests. # - Settings - wal_level = replica fsync = on synchronous_commit = on wal_sync_method = pmem_drain full_page_writes = on wal_compression = off # - Checkpoints - checkpoint_timeout = 1d max_wal_size = 20GB min_wal_size = 20GB >This seems like an exceedingly short test -- normally, for write >tests, I recommend the median of 3 30-minute runs. It also seems >likely to be client-bound, because of the fact that jobs = clients/4. >Normally I use jobs = clients or at least jobs = clients/2. > Thank you for your kind proposal. I did that. # numactl -N 0 pgbench -s 200 -i [DB_NAME] # numactl -N 1 pgbench -c 32 -j 32 -T 1800 -M prepared [DB_NAME] The averages of running pgbench three times are: wal_sync_method=fdatasync: tps = 39,966 wal_sync_method=pmem_drain: tps = 41,365 -- Yoshimi Ichiyanagi
On Fri, Jan 19, 2018 at 4:56 AM, Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp> wrote: >>Was the only non-default configuration setting wal_sync_method? i.e. >>synchronous_commit=on? No change to max_wal_size? > No, I used the following parameter in postgresql.conf to prevent > checkpoints from occurring while running the tests. I think that you really need to include the checkpoints in the tests. I would suggest setting max_wal_size and/or checkpoint_timeout so that you reliably complete 2 checkpoints in a 30-minute test, and then do a comparison on that basis. > Do you know any good WAL I/O intensive benchmarks? DBT2? pgbench is quite a WAL-intensive benchmark; it is much more write-heavy than what most systems experience in real life, at least in my experience. Your comparison of DAX FS to DAX FS + PMDK is very interesting, but in real life the bandwidth of DAX FS is already so high -- and the latency so low -- that I think most real-world workloads won't gain very much. At least, that is my impression based on internal testing EnterpriseDB did a few months back. (Thanks to Mithun and Kuntal for that work.) That's not necessarily an argument against this patch, which by the way I have not reviewed. Even a 5% speedup on this kind of workload is potentially worthwhile; everyone likes it when things go faster. I'm just not convinced you can get very much more than that on a realistic workload. Of course, I might be wrong. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 19, 2018 at 9:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > That's not necessarily an argument against this patch, which by the > way I have not reviewed. Even a 5% speedup on this kind of workload > is potentially worthwhile; everyone likes it when things go faster. > I'm just not convinced you can get very much more than that on a > realistic workload. Of course, I might be wrong. Oh, incidentally -- in our internal testing, we found that wal_sync_method=open_datasync was significantly faster than wal_sync_method=fdatasync. You might find that open_datasync isn't much different from pmem_drain, even though they're both faster than fdatasync. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com] > Oh, incidentally -- in our internal testing, we found that > wal_sync_method=open_datasync was significantly faster than > wal_sync_method=fdatasync. You might find that open_datasync isn't much > different from pmem_drain, even though they're both faster than fdatasync. That's interesting. How fast was open_datasync in what environment (Linux distro/kernel version, HDD or SSD etc.)? Is it now time to change the default setting to open_datasync on Linux, at least when O_DIRECT is not used (i.e. WAL archivingor streaming replication is used)? [Current port/linux.h] /* * Set the default wal_sync_method to fdatasync. With recent Linux versions, * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't * perform better and (b) causes outright failures on ext4 data=journal * filesystems, because those don't support O_DIRECT. */ #define PLATFORM_DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC pg_test_fsync showed open_datasync is slower on my RHEL6 VM: ----------------------------------------ep 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 4276.373 ops/sec 234 usecs/op fdatasync 4895.256 ops/sec 204 usecs/op fsync 4797.094 ops/sec 208 usecs/op fsync_writethrough n/a open_sync 4575.661 ops/sec 219 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 2243.680 ops/sec 446 usecs/op fdatasync 4347.466 ops/sec 230 usecs/op fsync 4337.312 ops/sec 231 usecs/op fsync_writethrough n/a open_sync 2329.700 ops/sec 429 usecs/op ----------------------------------------ep Regards Takayuki Tsunakawa
On Tue, Jan 23, 2018 at 8:07 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: Robert Haas [mailto:robertmhaas@gmail.com] >> Oh, incidentally -- in our internal testing, we found that >> wal_sync_method=open_datasync was significantly faster than >> wal_sync_method=fdatasync. You might find that open_datasync isn't much >> different from pmem_drain, even though they're both faster than fdatasync. > > That's interesting. How fast was open_datasync in what environment (Linux distro/kernel version, HDD or SSD etc.)? > > Is it now time to change the default setting to open_datasync on Linux, at least when O_DIRECT is not used (i.e. WAL archivingor streaming replication is used)? I think open_datasync will be worse on systems where fsync() is expensive -- it forces the data out to disk immediately, even if the data doesn't need to be flushed immediately. That's bad, because we wait immediately when we could have deferred the wait until later and maybe gotten the WAL writer to do the work in the background. But it might be better on systems where fsync() is basically free, because there you might as well just get it out of the way immediately and not leave something left to be done later. This is just a guess, of course. You didn't mention what the underlying storage for your test was? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com] > I think open_datasync will be worse on systems where fsync() is expensive > -- it forces the data out to disk immediately, even if the data doesn't > need to be flushed immediately. That's bad, because we wait immediately > when we could have deferred the wait until later and maybe gotten the WAL > writer to do the work in the background. But it might be better on systems > where fsync() is basically free, because there you might as well just get > it out of the way immediately and not leave something left to be done later. > > This is just a guess, of course. You didn't mention what the underlying > storage for your test was? Uh, your guess was correct. My file system was ext3, where fsync() writes all dirty buffers in page cache. As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM volume with ext4 (mounted with options noatime,nobarrier) on a PCIe flash memory. 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 50829.597 ops/sec 20 usecs/op fdatasync 42094.381 ops/sec 24 usecs/op fsync 42209.972 ops/sec 24 usecs/op fsync_writethrough n/a open_sync 48669.605 ops/sec 21 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 26366.373 ops/sec 38 usecs/op fdatasync 33922.725 ops/sec 29 usecs/op fsync 32990.209 ops/sec 30 usecs/op fsync_writethrough n/a open_sync 24326.249 ops/sec 41 usecs/op What do you think about changing the default value of wal_sync_method on Linux in PG 11? I can understand the concern thatusers might hit performance degredation if they are using PostgreSQL on older systems. But it's also mottainai thatmany users don't notice the benefits of wal_sync_method = open_datasync on new systems. Regards Takayuki Tsunakawa
On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: >> This is just a guess, of course. You didn't mention what the underlying >> storage for your test was? > > Uh, your guess was correct. My file system was ext3, where fsync() writes all dirty buffers in page cache. Oh, ext3 is terrible. I don't think you can do any meaningful benchmark results on ext3. Use ext4 or, if you prefer, xfs. > As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM volume with ext4 (mounted with options noatime,nobarrier) on a PCIe flash memory. So does that mean it was faster than your PMDK implementation? > What do you think about changing the default value of wal_sync_method on Linux in PG 11? I can understand the concernthat users might hit performance degredation if they are using PostgreSQL on older systems. But it's also mottainaithat many users don't notice the benefits of wal_sync_method = open_datasync on new systems. Well, some day persistent memory may be a common enough storage technology that such a change makes sense, but these days most people have either SSD or spinning disks, where the change would probably be a net negative. It seems more like something we might think about changing in PG 20 or PG 30. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com] > On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: > > As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on > a LVM volume with ext4 (mounted with options noatime, nobarrier) on a PCIe > flash memory. > > So does that mean it was faster than your PMDK implementation? The PMDK patch is not mine, but is from people in NTT Lab. I'm very curious about the comparison of open_datasync and PMDK,too. > > What do you think about changing the default value of wal_sync_method > on Linux in PG 11? I can understand the concern that users might hit > performance degredation if they are using PostgreSQL on older systems. But > it's also mottainai that many users don't notice the benefits of > wal_sync_method = open_datasync on new systems. > > Well, some day persistent memory may be a common enough storage technology > that such a change makes sense, but these days most people have either SSD > or spinning disks, where the change would probably be a net negative. It > seems more like something we might think about changing in PG 20 or PG 30. No, I'm not saying we should make the persistent memory mode the default. I'm simply asking whether it's time to make open_datasyncthe default setting. We can write a notice in the release note for users who still use ext3 etc. on old systems. If there's no objection, I'll submit a patch for the next CF. Regards Takayuki Tsunakawa
On Thu, Jan 25, 2018 at 7:08 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > No, I'm not saying we should make the persistent memory mode the default. I'm simply asking whether it's time to makeopen_datasync the default setting. We can write a notice in the release note for users who still use ext3 etc. on oldsystems. If there's no objection, I'll submit a patch for the next CF. Well, like I said, I think that will degrade performance for users of SSDs or spinning disks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
Michael Paquier
Date:
On Thu, Jan 25, 2018 at 09:30:45AM -0500, Robert Haas wrote: > On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: >>> This is just a guess, of course. You didn't mention what the underlying >>> storage for your test was? >> >> Uh, your guess was correct. My file system was ext3, where fsync() writes all dirty buffers in page cache. > > Oh, ext3 is terrible. I don't think you can do any meaningful > benchmark results on ext3. Use ext4 or, if you prefer, xfs. Or to put it short, the lack of granular syncs in ext3 kills performance for some workloads. Tomas Vondra's presentation on such matters are a really cool read by the way: https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zfs (I would have loved seeing this presentation in live). -- Michael
Attachment
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com]> On Thu, Jan 25, 2018 at 7:08 PM, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: > > No, I'm not saying we should make the persistent memory mode the default. > I'm simply asking whether it's time to make open_datasync the default > setting. We can write a notice in the release note for users who still > use ext3 etc. on old systems. If there's no objection, I'll submit a patch > for the next CF. > > Well, like I said, I think that will degrade performance for users of SSDs > or spinning disks. As I showed previously, regular file writes on PCIe flash, *not writes using PMDK on persistent memory*, was 20% faster withopen_datasync than with fdatasync. In addition, regular file writes on HDD with ext4 was also 10% faster: -------------------------------------------------- 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 3408.905 ops/sec 293 usecs/op fdatasync 3111.621 ops/sec 321 usecs/op fsync 3609.940 ops/sec 277 usecs/op fsync_writethrough n/a open_sync 3356.362 ops/sec 298 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 1892.157 ops/sec 528 usecs/op fdatasync 3284.278 ops/sec 304 usecs/op fsync 3066.655 ops/sec 326 usecs/op fsync_writethrough n/a open_sync 1853.415 ops/sec 540 usecs/op -------------------------------------------------- And you said open_datasync was significantly faster than fdatasync. Could you show your results? What device and filesystemdid you use? Regards Takayuki Tsunakawa
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael.paquier@gmail.com] > Or to put it short, the lack of granular syncs in ext3 kills performance > for some workloads. Tomas Vondra's presentation on such matters are a really > cool read by the way: > https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zf > s Yeah, I saw this recently, too. That was cool. Regards Takayuki Tsunakawa
On Thu, Jan 25, 2018 at 8:32 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > As I showed previously, regular file writes on PCIe flash, *not writes using PMDK on persistent memory*, was 20% fasterwith open_datasync than with fdatasync. If I understand correctly, those results are all just pg_test_fsync results. That's not reflective of what will happen when the database is actually running. When you use open_sync or open_datasync, you force WAL write and WAL flush to happen simultaneously, instead of letting the WAL flush be delayed. > And you said open_datasync was significantly faster than fdatasync. Could you show your results? What device and filesystemdid you use? I don't have the results handy at the moment. We found it to be faster on a database benchmark where the WAL was stored on an NVRAM device. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
"Tsunakawa, Takayuki"
Date:
From: Robert Haas [mailto:robertmhaas@gmail.com] > If I understand correctly, those results are all just pg_test_fsync results. > That's not reflective of what will happen when the database is actually > running. When you use open_sync or open_datasync, you force WAL write and > WAL flush to happen simultaneously, instead of letting the WAL flush be > delayed. Yes, that's pg_test_fsync output. Isn't pg_test_fsync the tool to determine the value for wal_sync_method? Is this manualmisleading? https://www.postgresql.org/docs/devel/static/pgtestfsync.html -------------------------------------------------- pg_test_fsync - determine fastest wal_sync_method for PostgreSQL pg_test_fsync is intended to give you a reasonable idea of what the fastest wal_sync_method is on your specific system, aswell as supplying diagnostic information in the event of an identified I/O problem. -------------------------------------------------- Anyway, I'll use pgbench, and submit a patch if open_datasync is better than fdatasync. I guess the current tweak of makingfdatasync the default is a holdover from the era before ext4 and XFS became prevalent. > I don't have the results handy at the moment. We found it to be faster > on a database benchmark where the WAL was stored on an NVRAM device. Oh, NVRAM. Interesting. Then I'll try open_datasync/fdatasync comparison on HDD and SSD/PCie flash with pgbench. Regards Takayuki Tsunakawa
On Thu, Jan 25, 2018 at 8:54 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > Yes, that's pg_test_fsync output. Isn't pg_test_fsync the tool to determine the value for wal_sync_method? Is this manualmisleading? Hmm. I hadn't thought about it as misleading, but now that you mention it, I'd say that it probably is. I suspect that there should be a disclaimer saying that the fastest WAL sync method in terms of ops/second is not necessarily the one that will deliver the best database performance, and mention the issues around open_sync and open_datasync specifically. But let's see what your testing shows; I'm talking based on now-fairly-old experience with this and a passing familiarity with the relevant source code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From
Yoshimi Ichiyanagi
Date:
<CA+TgmoZygQO3EC4mMdf-b=UuY3HZz6+-Y2w5_s9bLtH4NPw6Bg@mail.gmail.com> Fri, 19 Jan 2018 09:42:25 -0500Robert Haas <robertmhaas@gmail.com> wrote : > >I think that you really need to include the checkpoints in the tests. >I would suggest setting max_wal_size and/or checkpoint_timeout so that >you reliably complete 2 checkpoints in a 30-minute test, and then do a >comparison on that basis. Experimental setup: ------------------------- Server: HP ProLiant DL360 Gen9 CPU: Xeon E5-2667 v4 (3.20GHz); 2 processors(without HT) DRAM: DDR4-2400; 32 GiB/processor (8GiB/socket x 4 sockets/processor) x 2 processors NVDIMM: DDR4-2133; 32 GiB/processor (node 0: 8GiB/socket x 2 sockets/processor, node 1: 8GiB/socket x 6 sockets/processor) HDD: Seagate Constellation2 2.5inch SATA 3.0. 6Gb/s 1TB 7200rpm x 1 SATA-SSD: Crucial_CT500MX200SSD1 (SATA 3.2, SATA 6Gb/s) OS: Ubuntu 16.04, linux-4.12 DAX FS: ext4 PMDK: master(at)Aug 30, 2017 PostgreSQL: master Note: I bound the postgres processes to one NUMA node, and the benchmarks to other NUMA node. ------------------------- postgresql.conf ------------------------- # - Settings - wal_level = replica fsync = on synchronous_commit = on wal_sync_method = pmem_drain/fdatasync/open_datasync full_page_writes = on wal_compression = off # - Checkpoints - checkpoint_timeout = 12min max_wal_size = 20GB min_wal_size = 20GB ------------------------- Executed commands: -------------------------------------------------------------------- # numactl -N 1 pg_ctl start -D [PG_DIR] -l [LOG_FILE] # numactl -N 0 pgbench -s 200 -i [DB_NAME] # numactl -N 0 pgbench -c 32 -j 32 -T 1800 -r [DB_NAME] -M prepared -------------------------------------------------------------------- The results: -------------------------------------------------------------------- A) Applied the patches to PG src, and compiled PG with libpmem B) Applied the patches to PG src, and compiled PG without libpmem C) Original PG The averages of running pgbench three times on *PMEM* are: A) wal_sync_method = pmem_drain tps = 41660.42524 wal_sync_method = open_datasync tps = 39913.49897 wal_sync_method = fdatasync tps = 39900.83396 C) wal_sync_method = open_datasync tps = 40335.50178 wal_sync_method = fdatasync tps = 40649.57772 The averages of running pgbench three times on *SATA-SSD* are: B) wal_sync_method = open_datasync tps = 7224.07146 wal_sync_method = fdatasync tps = 7222.19177 C) wal_sync_method = open_datasync tps = 7258.79093 wal_sync_method = fdatasync tps = 7263.19878 -------------------------------------------------------------------- From the above results, it show that wal_sync_method=pmem_drain was about faster than wal_sync_method=open_datasync/fdatasync. When pgbench ran on SATA-SSD, wal_sync_method=fdatasync was as fast as wal_sync_method=open_datasync. >> Do you know any good WAL I/O intensive benchmarks? DBT2? > >pgbench is quite a WAL-intensive benchmark; it is much more >write-heavy than what most systems experience in real life, at least >in my experience. Your comparison of DAX FS to DAX FS + PMDK is very >interesting, but in real life the bandwidth of DAX FS is already so >high -- and the latency so low -- that I think most real-world >workloads won't gain very much. At least, that is my impression based >on internal testing EnterpriseDB did a few months back. (Thanks to >Mithun and Kuntal for that work.) In the near future, many physical devices will send sensing data (IoT might allow devices to exhaust tens Giga network bandwidth). The amount of data inserted in the DB will significantly increase. I think that PMEM will be needed for use cases like IoT. <CA+TgmobDO4qj2nMLdm2Dv5VRT8cVQjv7kftsS_P-kNpNw=TRug@mail.gmail.com> Thu, 25 Jan 2018 09:30:45 -0500Robert Haas <robertmhaas@gmail.com> wrote : >Well, some day persistent memory may be a common enough storage >technology that such a change makes sense, but these days most people >have either SSD or spinning disks, where the change would probably be >a net negative. It seems more like something we might think about >changing in PG 20 or PG 30. > Oracle and Microsoft SQL Server suported PMEM [1][2]. I think it is not too early for PostgreSQL to support PMEM. [1] http://dbheartbeat.blogspot.jp/2017/11/doag-2017-oracle-18c-dbim-oracle.htm [2] https://www.snia.org/sites/default/files/PM-Summit/2018/presentations/06_PM_Summit_2018_Talpey-Final_Post-CORRECTED.pdf -- Yoshimi Ichiyanagi
On Tue, Jan 30, 2018 at 3:37 AM, Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp> wrote: > Oracle and Microsoft SQL Server suported PMEM [1][2]. > I think it is not too early for PostgreSQL to support PMEM. I agree; it's good to have the option available for those who have access to the hardware. If you haven't added your patch to the next CommitFest, please do so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From
Yoshimi Ichiyanagi
Date:
>On Tue, Jan 30, 2018 at 3:37 AM, Yoshimi Ichiyanagi ><ichiyanagi.yoshimi@lab.ntt.co.jp> wrote: >> Oracle and Microsoft SQL Server suported PMEM [1][2]. >> I think it is not too early for PostgreSQL to support PMEM. > >I agree; it's good to have the option available for those who have >access to the hardware. > >If you haven't added your patch to the next CommitFest, please do so. Thank you for your time. I added my patches to the CommitFest 2018-3. https://commitfest.postgresql.org/17/1485/ Oh by the way, we submitted this proposal(Introducing PMDK into PostgreSQL) to PGcon2018. If our proposal is accepted and you have time, please listen to our presentation. -- Yoshimi Ichiyanagi Mailto : ichiyanagi.yoshimi@lab.ntt.co.jp
On 2018-02-05 09:59:25 +0900, Yoshimi Ichiyanagi wrote: > I added my patches to the CommitFest 2018-3. > https://commitfest.postgresql.org/17/1485/ Unfortunately this is the last CF for the v11 development cycle. This is a major project submitted late for v11, there's been no code level review, the goals aren't agreed upon yet, etc. So I'd unfortunately like to move this to the next CF? Greetings, Andres Freund
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
Heikki Linnakangas
Date:
On 16/01/18 15:00, Yoshimi Ichiyanagi wrote: > Hi. > > These patches enable to use Persistent Memory Development Kit(PMDK)[1] > for reading/writing WAL logs on persistent memory(PMEM). > PMEM is next generation storage and it has a number of nice features: > fast, byte-addressable and non-volatile. Interesting. How does this compare with using good old mmap()? I think just doing that would allow eliminating much of the complexity around managing the shared_buffers. And if the OS is smart about persistent memory (I don't know what the state of the art on that is), presumably msync() and fsync() on an file that lives in persistent memory is lightning fast. - Heikki
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From
Yoshimi Ichiyanagi
Date:
<20180301103641.tudam4mavba3god7@alap3.anarazel.de> Thu, 1 Mar 2018 02:36:41 -0800Andres Freund <andres@anarazel.de> wrote : Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory >On 2018-02-05 09:59:25 +0900, Yoshimi Ichiyanagi wrote: >> I added my patches to the CommitFest 2018-3. >> https://commitfest.postgresql.org/17/1485/ > >Unfortunately this is the last CF for the v11 development cycle. This is >a major project submitted late for v11, there's been no code level >review, the goals aren't agreed upon yet, etc. So I'd unfortunately like >to move this to the next CF? I get it. I modified the status to "move to next CF". -- Yoshimi Ichiyanagi NTT laboratories
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
Heikki Linnakangas
Date:
On 01/03/18 12:40, Heikki Linnakangas wrote: > On 16/01/18 15:00, Yoshimi Ichiyanagi wrote: >> These patches enable to use Persistent Memory Development Kit(PMDK)[1] >> for reading/writing WAL logs on persistent memory(PMEM). >> PMEM is next generation storage and it has a number of nice features: >> fast, byte-addressable and non-volatile. > > Interesting. How does this compare with using good old mmap()? I think > just doing that would allow eliminating much of the complexity around > managing the shared_buffers. And if the OS is smart about persistent > memory (I don't know what the state of the art on that is), presumably > msync() and fsync() on an file that lives in persistent memory is > lightning fast. I briefly looked at the docs at pmem.io. pmem_map_file() uses mmap() under the hood, but it does some extra checks to test if the files is on a persistent memory device, and makes a note of it. I think the way forward with this patch would be to map WAL segments with plain old mmap(), and use msync(). If that's faster than the status quo, great. If not, it would still be a good stepping stone for actually using PMDK. If nothing else, it would provide a way to test most of the code paths, without actually having a persistent memory device, or libpmem. The examples at http://pmem.io/pmdk/libpmem/ actually sugest doing exactly that: use libpmem to map a file to memory, and check if it lives on persistent memory using libpmem's pmem_is_pmem() function. If it returns yes, use pmem_drain(), if it return false, fall back to using msync(). - Heikki
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From
Yoshimi Ichiyanagi
Date:
I'm sorry for the delay in replying your mail. <91411837-8c65-bf7d-7ca3-d69bdcb4968a@iki.fi> Thu, 1 Mar 2018 18:40:05 +0800Heikki Linnakangas <hlinnaka@iki.fi> wrote : >Interesting. How does this compare with using good old mmap()? The libpmem's pmem_map_file() supported 2M/1G(the size of huge page) alignment, since it could reduce the number of page faults. In addition, libpmem's pmem_memcpy_nodrain() is the function to copy data using single instruction, multiple data(SIMD) instructions and NT store instructions(MOVNT). As a result, using these APIs is faster than using old mmap()/memcpy(). Please see the PGCon2018 presentation[1] for the details. [1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf <83eafbfd-d9c5-6623-2423-7cab1be3888c@iki.fi> Fri, 20 Jul 2018 23:18:05 +0300Heikki Linnakangas <hlinnaka@iki.fi> wrote : >I think the way forward with this patch would be to map WAL segments >with plain old mmap(), and use msync(). If that's faster than the status >quo, great. If not, it would still be a good stepping stone for actually >using PMDK. I think so too. I wrote this patch to replace read/write syscalls with libpmem's API only. I believe that PMDK can make the current PostgreSQL faster. > If nothing else, it would provide a way to test most of the >code paths, without actually having a persistent memory device, or >libpmem. The examples at http://pmem.io/pmdk/libpmem/ actually sugest >doing exactly that: use libpmem to map a file to memory, and check if it >lives on persistent memory using libpmem's pmem_is_pmem() function. If >it returns yes, use pmem_drain(), if it return false, fall back to using >msync(). When PMEM_IS_PMEM_FORCE(the environment variable[2]) is set to 1, pmem_is_pmem() return yes. Linux 4.15 and more supported MAP_SYNC and MAP_SHARED_VALIDATE of mmap() flags to check if the mapped file is stored on PMEM. An application that used both flags in its mmap() call can be sure that MAP_SYNC is actually supported by both the kernel and the filesystem that the mapped file is stored in[3]. But pmem_is_pmem() doesn't support this mechanism for now. [2] http://pmem.io/pmdk/manpages/linux/v1.4/libpmem/libpmem.7.html [3] https://lwn.net/Articles/758594/ -- Yoshimi Ichiyanagi NTT Software Innovation Center e-mail : ichiyanagi.yoshimi@lab.ntt.co.jp
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
Michael Paquier
Date:
On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote: > The libpmem's pmem_map_file() supported 2M/1G(the size of huge page) > alignment, since it could reduce the number of page faults. > In addition, libpmem's pmem_memcpy_nodrain() is the function > to copy data using single instruction, multiple data(SIMD) instructions > and NT store instructions(MOVNT). > As a result, using these APIs is faster than using old mmap()/memcpy(). > > Please see the PGCon2018 presentation[1] for the details. > > [1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf So you say that this represents a 3% gain based on the presentation? That may be interesting to dig into it. Could you provide fresher performance numbers? I am moving this patch to the next CF 2018-10 for now, waiting for input from the author. -- Michael
Attachment
> On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote: > > The libpmem's pmem_map_file() supported 2M/1G(the size of huge page) > > alignment, since it could reduce the number of page faults. > > In addition, libpmem's pmem_memcpy_nodrain() is the function > > to copy data using single instruction, multiple data(SIMD) instructions > > and NT store instructions(MOVNT). > > As a result, using these APIs is faster than using old mmap()/memcpy(). > > > > Please see the PGCon2018 presentation[1] for the details. > > > > [1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf > > So you say that this represents a 3% gain based on the presentation? > That may be interesting to dig into it. Could you provide fresher > performance numbers? I am moving this patch to the next CF 2018-10 for > now, waiting for input from the author. Unfortunately, the patch has some conflicts now, so probably not only fresher performance numbers are necessary, but also a rebased version.
> On Thu, Nov 29, 2018 at 6:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote: > > > The libpmem's pmem_map_file() supported 2M/1G(the size of huge page) > > > alignment, since it could reduce the number of page faults. > > > In addition, libpmem's pmem_memcpy_nodrain() is the function > > > to copy data using single instruction, multiple data(SIMD) instructions > > > and NT store instructions(MOVNT). > > > As a result, using these APIs is faster than using old mmap()/memcpy(). > > > > > > Please see the PGCon2018 presentation[1] for the details. > > > > > > [1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf > > > > So you say that this represents a 3% gain based on the presentation? > > That may be interesting to dig into it. Could you provide fresher > > performance numbers? I am moving this patch to the next CF 2018-10 for > > now, waiting for input from the author. > > Unfortunately, the patch has some conflicts now, so probably not only fresher > performance numbers are necessary, but also a rebased version. I believe the idea behind this patch is quite important (thanks to CMU DG for inspiring lectures), so I decided to put some efforts and rebase it to prevent from rotting. At the same time I have a vague impression that the patch itself suggests quite narrow way of using of PMDK. > On 01/03/18 12:40, Heikki Linnakangas wrote: > > On 16/01/18 15:00, Yoshimi Ichiyanagi wrote: > >> These patches enable to use Persistent Memory Development Kit(PMDK)[1] > >> for reading/writing WAL logs on persistent memory(PMEM). > >> PMEM is next generation storage and it has a number of nice features: > >> fast, byte-addressable and non-volatile. > > > > Interesting. How does this compare with using good old mmap()? E.g. byte-addressability is not used here at all, and it's probably one of the most cool properties, when we write not a block/page, but a small amount of data and flush it using PMDK.
Attachment
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
Heikki Linnakangas
Date:
On 10/12/2018 23:37, Dmitry Dolgov wrote: >> On Thu, Nov 29, 2018 at 6:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> >>> On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier <michael@paquier.xyz> wrote: >>> >>> On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote: >>>> The libpmem's pmem_map_file() supported 2M/1G(the size of huge page) >>>> alignment, since it could reduce the number of page faults. >>>> In addition, libpmem's pmem_memcpy_nodrain() is the function >>>> to copy data using single instruction, multiple data(SIMD) instructions >>>> and NT store instructions(MOVNT). >>>> As a result, using these APIs is faster than using old mmap()/memcpy(). >>>> >>>> Please see the PGCon2018 presentation[1] for the details. >>>> >>>> [1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf >>> >>> So you say that this represents a 3% gain based on the presentation? >>> That may be interesting to dig into it. Could you provide fresher >>> performance numbers? I am moving this patch to the next CF 2018-10 for >>> now, waiting for input from the author. >> >> Unfortunately, the patch has some conflicts now, so probably not only fresher >> performance numbers are necessary, but also a rebased version. > > I believe the idea behind this patch is quite important (thanks to CMU DG for > inspiring lectures), so I decided to put some efforts and rebase it to prevent > from rotting. At the same time I have a vague impression that the patch itself > suggests quite narrow way of using of PMDK. Thanks. To re-iterate what I said earlier in this thread, I think the next step here is to write a patch that modifies xlog.c to use plain old mmap()/msync() to memory-map the WAL files, to replace the WAL buffers. Let's see what the performance of that is, with or without NVM hardware. I think that might actually make the code simpler. There's a bunch of really hairy code around locking the WAL buffers, which could be made simpler if each backend memory-mapped the WAL segment files independently. One thing to watch out for, is that if you read() a file, and there's an I/O error, you have a chance to ereport() it. If you try to read from a memory-mapped file, and there's an I/O error, the process is killed with SIGBUS. So I think we have to be careful with using memory-mapped I/O for reading files. But for writing WAL files, it seems like a good fit. Once we have a reliable mmap()/msync() implementation running, it should be straightforward to change it to use MAP_SYNC and the special CPU instructions for the flushing. - Heikki
Hi, On 2019-01-23 18:45:42 +0200, Heikki Linnakangas wrote: > To re-iterate what I said earlier in this thread, I think the next step here > is to write a patch that modifies xlog.c to use plain old mmap()/msync() to > memory-map the WAL files, to replace the WAL buffers. Let's see what the > performance of that is, with or without NVM hardware. I think that might > actually make the code simpler. There's a bunch of really hairy code around > locking the WAL buffers, which could be made simpler if each backend > memory-mapped the WAL segment files independently. > > One thing to watch out for, is that if you read() a file, and there's an I/O > error, you have a chance to ereport() it. If you try to read from a > memory-mapped file, and there's an I/O error, the process is killed with > SIGBUS. So I think we have to be careful with using memory-mapped I/O for > reading files. But for writing WAL files, it seems like a good fit. > > Once we have a reliable mmap()/msync() implementation running, it should be > straightforward to change it to use MAP_SYNC and the special CPU > instructions for the flushing. FWIW, I don't think we should go there as the sole implementation. I'm fairly convinced that we're going to need to go to direct-IO in more cases here, and that'll not work well with mmap. I think this'd be a worthwhile experiment, but I'm doubtful it'd end up simplifying our code. Greetings, Andres Freund
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From
"Takashi Menjo"
Date:
Hello, On behalf of Yoshimi, I rebased the patchset onto the latest master (e3565fd6). Please see the attachment. It also includes an additional bug fix (in patch 0002) about temporary filename. Note that PMDK 1.4.2+ supports MAP_SYNC and MAP_SHARED_VALIDATE flags, so please use a new version of PMDK when you test. The latest version is 1.5. Heikki Linnakangas wrote: > To re-iterate what I said earlier in this thread, I think the next step > here is to write a patch that modifies xlog.c to use plain old > mmap()/msync() to memory-map the WAL files, to replace the WAL buffers. Sorry but my new patchset still uses PMDK, because PMDK is supported on Linux _and Windows_, and I think someone may want to test this patchset on Windows... Regards, Takashi -- Takashi Menjo - NTT Software Innovation Center <menjo.takashi@lab.ntt.co.jp>
Attachment
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
Peter Eisentraut
Date:
On 25/01/2019 09:52, Takashi Menjo wrote: > Heikki Linnakangas wrote: >> To re-iterate what I said earlier in this thread, I think the next step >> here is to write a patch that modifies xlog.c to use plain old >> mmap()/msync() to memory-map the WAL files, to replace the WAL buffers. > Sorry but my new patchset still uses PMDK, because PMDK is supported on > Linux > _and Windows_, and I think someone may want to test this patchset on > Windows... When you manage the WAL (or perhaps in the future relation files) through PMDK, is there still a file system view of it somewhere, for browsing, debugging, and for monitoring tools? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From
"Takashi Menjo"
Date:
Hi, Peter Eisentraut wrote: > When you manage the WAL (or perhaps in the future relation files) > through PMDK, is there still a file system view of it somewhere, for > browsing, debugging, and for monitoring tools? First, I assume that our patchset is used with a filesystem that supports direct access (DAX) feature, and I test it with ext4 on Linux. You can cd into pg_wal directory created by initdb -X pg_wal on such a filesystem, and ls WAL segment files managed by PMDK at runtime. For each PostgreSQL-specific tool, perhaps yes, but I have not tested yet. At least, pg_waldump looks working as before. Regards, Takashi -- Takashi Menjo - NTT Software Innovation Center <menjo.takashi@lab.ntt.co.jp>
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From
"Takashi Menjo"
Date:
Hi, Sorry but I found that the patchset v2 had a bug in managing WAL segment file offset. I fixed it and updated a patchset as v3 (attached). Regards, Takashi -- Takashi Menjo - NTT Software Innovation Center <menjo.takashi@lab.ntt.co.jp>
Attachment
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistentmemory
From
Peter Eisentraut
Date:
On 30/01/2019 07:16, Takashi Menjo wrote: > Sorry but I found that the patchset v2 had a bug in managing WAL segment > file offset. I fixed it and updated a patchset as v3 (attached). I'm concerned with how this would affect the future maintenance of this code. You are introducing a whole separate code path for PMDK beside the normal file path (and it doesn't seem very well separated either). Now everyone who wants to do some surgery in the WAL code needs to take that into account. And everyone who wants to do performance work in the WAL code needs to check that the PMDK path doesn't regress. AFAICT, this hardware isn't very popular at the moment, so it would be very hard to peer review any work in this area. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory
From
"Takashi Menjo"
Date:
Peter Eisentraut wrote: > I'm concerned with how this would affect the future maintenance of this > code. You are introducing a whole separate code path for PMDK beside > the normal file path (and it doesn't seem very well separated either). > Now everyone who wants to do some surgery in the WAL code needs to take > that into account. And everyone who wants to do performance work in the > WAL code needs to check that the PMDK path doesn't regress. AFAICT, > this hardware isn't very popular at the moment, so it would be very hard > to peer review any work in this area. Thank you for your comment. It is reasonable that you are concerned with maintainability. Our patchset still lacks of it. I will consider about that when I submit a next update. (It may take a long time, so please be patient...) Regards, Takashi -- Takashi Menjo - NTT Software Innovation Center <menjo.takashi@lab.ntt.co.jp>
Dear hackers,
I rebased my old patchset. It would be good to compare this v4 patchset to non-volatile WAL buffer's one [1].
I rebased my old patchset. It would be good to compare this v4 patchset to non-volatile WAL buffer's one [1].
Regards,
Takashi
Takashi Menjo <takashi.menjo@gmail.com>