Write lifetime hints for NVMe - Mailing list pgsql-hackers

From Dmitry Dolgov
Subject Write lifetime hints for NVMe
Date
Msg-id CA+q6zcX_iz9ekV7MyO6xGH1LHHhiutmHY34n1VHNN3dLf_4C4Q@mail.gmail.com
Whole thread Raw
Responses Re: Write lifetime hints for NVMe
List pgsql-hackers
Hi,

From what I see some time ago the write lifetime hints support for NVMe multi
streaming was merged into Linux kernel [1]. Theoretically it allows data
written together on media so they can be erased together, which minimizes
garbage collection, resulting in reduced write amplification as well as
efficient flash utilization [2]. I couldn't find any discussion about that on
hackers, so I decided to experiment with this feature a bit. My idea was to
test quite naive approach when all file descriptors, that are related to
temporary files, have assigned `RWH_WRITE_LIFE_SHORT`, and rest of them
`RWH_WRITE_LIFE_EXTREME`. Attached patch is a dead simple POC without any
infrastructure around to enable/disable hints.

It turns out that it's possible to perform benchmarks on some EC2 instance
types (e.g. c5) with the corresponding version of the kernel, since they expose
a volume as nvme device:

```
# nvme list
Node             SN                   Model
        Namespace Usage                      Format           FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1     vol01cdbc7ec86f17346 Amazon Elastic Block Store
        1           0.00   B /   8.59  GB    512   B +  0 B   1.0
```

To get some baseline results I've run several rounds of pgbench on these quite
modest instances (dedicated, with optimized EBS) with slightly adjusted
`max_wal_size` and with default configuration:

$ pgbench -s 200 -i
$ pgbench -T 600 -c 2 -j 2

Analyzing `strace` output I can see that during this test there were some
significant number of operations with pg_stat_tmp and xlogtemp, so I assume
write lifetime hints should have some effect.

As a result I've got reduction of latency about 5-8% (but so far these numbers
are unstable, probably because of virtualization).

```
# without patch
number of transactions actually processed: 491945
latency average = 2.439 ms
tps = 819.906323 (including connections establishing)
tps = 819.908755 (excluding connections establishing)
```

```
with patch
number of transactions actually processed: 521805
latency average = 2.300 ms
tps = 869.665330 (including connections establishing)
tps = 869.668026 (excluding connections establishing)
```

So I have a few questions:

* Does it sound interesting and worthwhile to create a proper patch?

* Maybe someone else has similar results?

* Any suggestions about what can be the best/worst case scenarios of using such
  kind of hints?


[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c75b1d9421f80f4143e389d2d50ddfc8a28c8c35
[2]: https://regmedia.co.uk/2016/09/23/0_storage-intelligence-prodoverview-2015-0.pdf

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Setting BLCKSZ 4kB
Next
From: "Daniel Verite"
Date:
Subject: Re: [HACKERS] proposal: psql command \graw