RE: [PoC] Non-volatile WAL buffer - Mailing list pgsql-hackers

From Takashi Menjo
Subject RE: [PoC] Non-volatile WAL buffer
Date
Msg-id 000c01d5fdb5$2f9db430$8ed91c90$@hco.ntt.co.jp_1
Whole thread Raw
In response to Re: [PoC] Non-volatile WAL buffer  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Dear Andres,

Thank you for your advice about MAP_POPULATE flag.  I rebased my msync patchset onto master and added a commit to
appendthat flag
 
when mmap.  A new v2 patchset is attached to this mail.  Note that this patchset is NOT non-volatile WAL buffer's one.

I also measured performance of the following three versions, varying -c/--client and -j/--jobs options of pgbench, for
eachscaling
 
factor s = 50 or 1000.

- Before patchset (say "before")
- After patchset except patch 0005 not to use MAP_POPULATE ("after (no populate)")
- After full patchset to use MAP_POPULATE ("after (populate)")

The results are presented in the following tables and the attached charts.  Conditions, steps, and other details will
beshown
 
later.  Note that, unlike the measurement of non-volatile WAL buffer I sent recently [1], I used an NVMe SSD for pg_wal
toevaluate
 
this patchset with traditional mmap-ed files, that is, direct access (DAX) is not supported and there are page caches.


Results (s=50)
==============
         Throughput [10^3 TPS]
( c, j)  before  after           after
                 (no populate)   (populate)
-------  -------------------------------------
( 8, 8)  30.9    28.1 (- 9.2%)   28.3 (- 8.6%)
(18,18)  61.5    46.1 (-25.0%)   47.7 (-22.3%)
(36,18)  67.0    45.9 (-31.5%)   48.4 (-27.8%)
(54,18)  68.3    47.0 (-31.3%)   49.6 (-27.5%)

         Average Latency [ms]
( c, j)  before  after           after
                 (no populate)   (populate)
-------  --------------------------------------
( 8, 8)  0.259   0.285 (+10.0%)  0.283 (+ 9.3%)
(18,18)  0.293   0.391 (+33.4%)  0.377 (+28.7%)
(36,18)  0.537   0.784 (+46.0%)  0.744 (+38.5%)
(54,18)  0.790   1.149 (+45.4%)  1.090 (+38.0%)


Results (s=1000)
================
         Throghput [10^3 TPS]
( c, j)  before  after           after
                 (no populate)   (populate)
-------  ------------------------------------
( 8, 8)  32.0    29.6 (- 7.6%)   29.1 (- 9.0%)
(18,18)  66.1    49.2 (-25.6%)   50.4 (-23.7%)
(36,18)  76.4    51.0 (-33.3%)   53.4 (-30.1%)
(54,18)  80.1    54.3 (-32.2%)   57.2 (-28.6%)

         Average latency [10^3 TPS]
( c, j)  before  after           after
                 (no populate)   (populate)
-------  --------------------------------------
( 8, 8)  0.250   0.271 (+ 8.4%)  0.275 (+10.0%)
(18,18)  0.272   0.366 (+34.6%)  0.357 (+31.3%)
(36,18)  0.471   0.706 (+49.9%)  0.674 (+43.1%)
(54,18)  0.674   0.995 (+47.6%)  0.944 (+40.1%)


I'd say MAP_POPULATE made performance a little better in large #clients cases, comparing "populate" with "no populate".
However,
comparing "after" with "before", I found both throughput and average latency degraded.  VTune told me that "after
(populate)"still
 
spent larger CPU time for memcpy-ing WAL records into mmap-ed segments than "before".

I also made a microbenchmark to see the behavior of mmap and msync.  I found that:

- A major fault occured at mmap with MAP_POPULATE, instead at first access to the mmap-ed space.
- Some minor faults also occured at mmap with MAP_POPULATE, and no additional fault occured when I loaded from the
mmap-edspace.
 
But once I stored to that space, a minor fault occured.
- When I stored to the page that had been msync-ed, a minor fault occurred.

So I think one of the remaining causes of performance degrade is minor faults when mmap-ed pages get dirtied.  And it
seemsnot to
 
be solved by MAP_POPULATE only, as far as I see.


Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
  - Pin postgres (server processes) to node 0 and pgbench to node 1
  - 18 cores and 192GiB DRAM per node
- Use two NVMe SSDs; one for PGDATA, another for pg_wal
  - Both are installed on the server-side node, that is, node 0
  - Both are formatted with ext4
- Use the attached postgresql.conf


Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown
inthe
 
tables above.

(1) Run initdb with proper -D and -X options
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes


pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.


Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)


Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA x2


Best regards,
Takashi


[1] https://www.postgresql.org/message-id/002701d5fd03$6e1d97a0$4a58c6e0$@hco.ntt.co.jp_1

-- 
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

> -----Original Message-----
> From: Andres Freund <andres@anarazel.de>
> Sent: Thursday, February 20, 2020 2:04 PM
> To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
> Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>;
> pgsql-hackers@postgresql.org
> Subject: Re: [PoC] Non-volatile WAL buffer
> 
> Hi,
> 
> On 2020-02-17 13:12:37 +0900, Takashi Menjo wrote:
> > I applied my patchset that mmap()-s WAL segments as WAL buffers to
> > refs/tags/REL_12_0, and measured and analyzed its performance with
> > pgbench.  Roughly speaking, When I used *SSD and ext4* to store WAL,
> > it was "obviously worse" than the original REL_12_0.  VTune told me
> > that the CPU time of memcpy() called by CopyXLogRecordToWAL() got
> > larger than before.
> 
> FWIW, this might largely be because of page faults. In contrast to before we wouldn't reuse the same pages
> (because they've been munmap()/mmap()ed), so the first time they're touched, we'll incur page faults.  Did you
> try mmap()ing with MAP_POPULATE? It's probably also worthwhile to try to use MAP_HUGETLB.
> 
> Still doubtful it's the right direction, but I'd rather have good numbers to back me up :)
> 
> Greetings,
> 
> Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Berserk Autovacuum (let's save next Mandrill)
Next
From: Fujii Masao
Date:
Subject: Re: RecoveryWalAll and RecoveryWalStream wait events