Thread: Write lifetime hints for NVMe

Write lifetime hints for NVMe

From
Dmitry Dolgov
Date:
Hi,

From what I see some time ago the write lifetime hints support for NVMe multi
streaming was merged into Linux kernel [1]. Theoretically it allows data
written together on media so they can be erased together, which minimizes
garbage collection, resulting in reduced write amplification as well as
efficient flash utilization [2]. I couldn't find any discussion about that on
hackers, so I decided to experiment with this feature a bit. My idea was to
test quite naive approach when all file descriptors, that are related to
temporary files, have assigned `RWH_WRITE_LIFE_SHORT`, and rest of them
`RWH_WRITE_LIFE_EXTREME`. Attached patch is a dead simple POC without any
infrastructure around to enable/disable hints.

It turns out that it's possible to perform benchmarks on some EC2 instance
types (e.g. c5) with the corresponding version of the kernel, since they expose
a volume as nvme device:

```
# nvme list
Node             SN                   Model
        Namespace Usage                      Format           FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1     vol01cdbc7ec86f17346 Amazon Elastic Block Store
        1           0.00   B /   8.59  GB    512   B +  0 B   1.0
```

To get some baseline results I've run several rounds of pgbench on these quite
modest instances (dedicated, with optimized EBS) with slightly adjusted
`max_wal_size` and with default configuration:

$ pgbench -s 200 -i
$ pgbench -T 600 -c 2 -j 2

Analyzing `strace` output I can see that during this test there were some
significant number of operations with pg_stat_tmp and xlogtemp, so I assume
write lifetime hints should have some effect.

As a result I've got reduction of latency about 5-8% (but so far these numbers
are unstable, probably because of virtualization).

```
# without patch
number of transactions actually processed: 491945
latency average = 2.439 ms
tps = 819.906323 (including connections establishing)
tps = 819.908755 (excluding connections establishing)
```

```
with patch
number of transactions actually processed: 521805
latency average = 2.300 ms
tps = 869.665330 (including connections establishing)
tps = 869.668026 (excluding connections establishing)
```

So I have a few questions:

* Does it sound interesting and worthwhile to create a proper patch?

* Maybe someone else has similar results?

* Any suggestions about what can be the best/worst case scenarios of using such
  kind of hints?


[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c75b1d9421f80f4143e389d2d50ddfc8a28c8c35
[2]: https://regmedia.co.uk/2016/09/23/0_storage-intelligence-prodoverview-2015-0.pdf

Attachment

Re: Write lifetime hints for NVMe

From
Tomas Vondra
Date:

On 01/27/2018 02:20 PM, Dmitry Dolgov wrote:
> Hi,
> 
> From what I see some time ago the write lifetime hints support for NVMe multi
> streaming was merged into Linux kernel [1]. Theoretically it allows data
> written together on media so they can be erased together, which minimizes
> garbage collection, resulting in reduced write amplification as well as
> efficient flash utilization [2]. I couldn't find any discussion about that on
> hackers, so I decided to experiment with this feature a bit. My idea was to
> test quite naive approach when all file descriptors, that are related to
> temporary files, have assigned `RWH_WRITE_LIFE_SHORT`, and rest of them
> `RWH_WRITE_LIFE_EXTREME`. Attached patch is a dead simple POC without any
> infrastructure around to enable/disable hints.
> 
> It turns out that it's possible to perform benchmarks on some EC2 instance
> types (e.g. c5) with the corresponding version of the kernel, since they expose
> a volume as nvme device:
> 
> ```
> # nvme list
> Node             SN                   Model
>         Namespace Usage                      Format           FW Rev
> ---------------- --------------------
> ---------------------------------------- ---------
> -------------------------- ---------------- --------
> /dev/nvme0n1     vol01cdbc7ec86f17346 Amazon Elastic Block Store
>         1           0.00   B /   8.59  GB    512   B +  0 B   1.0
> ```
> 
> To get some baseline results I've run several rounds of pgbench on these quite
> modest instances (dedicated, with optimized EBS) with slightly adjusted
> `max_wal_size` and with default configuration:
> 
> $ pgbench -s 200 -i
> $ pgbench -T 600 -c 2 -j 2
> 
> Analyzing `strace` output I can see that during this test there were some
> significant number of operations with pg_stat_tmp and xlogtemp, so I assume
> write lifetime hints should have some effect.
> 
> As a result I've got reduction of latency about 5-8% (but so far these numbers
> are unstable, probably because of virtualization).
> 
> ```
> # without patch
> number of transactions actually processed: 491945
> latency average = 2.439 ms
> tps = 819.906323 (including connections establishing)
> tps = 819.908755 (excluding connections establishing)
> ```
> 
> ```
> with patch
> number of transactions actually processed: 521805
> latency average = 2.300 ms
> tps = 869.665330 (including connections establishing)
> tps = 869.668026 (excluding connections establishing)
> ```
> 

Aren't those numbers far lower that you'd expect from NVMe storage? I do
have a NVMe drive (Intel 750) in my machine, and I can do thousands of
transactions on it with two clients. Seems a bit suspicious.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Write lifetime hints for NVMe

From
Dmitry Dolgov
Date:
> On 27 January 2018 at 16:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Aren't those numbers far lower that you'd expect from NVMe storage? I do
> have a NVMe drive (Intel 750) in my machine, and I can do thousands of
> transactions on it with two clients. Seems a bit suspicious.

Maybe an NVMe storage can provide much higher numbers in general, but there are
resource limitations from AWS itself. I was using c5.large, which is the
smallest possible instance of type c5, so maybe that can explain absolute
numbers - but anyway I can recheck, just in case if I missed something.


Re: Write lifetime hints for NVMe

From
Tomas Vondra
Date:
On 01/27/2018 08:06 PM, Dmitry Dolgov wrote:
>> On 27 January 2018 at 16:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Aren't those numbers far lower that you'd expect from NVMe storage? I do
>> have a NVMe drive (Intel 750) in my machine, and I can do thousands of
>> transactions on it with two clients. Seems a bit suspicious.
> 
> Maybe an NVMe storage can provide much higher numbers in general, but there are
> resource limitations from AWS itself. I was using c5.large, which is the
> smallest possible instance of type c5, so maybe that can explain absolute
> numbers - but anyway I can recheck, just in case if I missed something.
> 

According to [1] the C5 instances don't have actual NVMe devices (say,
storage in PCIe slot or connected using M.2) but EBS volumes exposed as
NVMe devices. That would certainly make explain the low IOPS numbers, as
EBS has built-in throttling. I don't know how much of the NVMe features
does this EBS variant support.

Amazon actually does provide instance types (f1 and i3) with real NVMe
devices. That's what I'd be testing.


I can do some testing on my system with NVMe storage, to see if there
really is any change thanks to the patch.

[1]
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Write lifetime hints for NVMe

From
Dmitry Dolgov
Date:
> On 27 January 2018 at 23:53, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> Amazon actually does provide instance types (f1 and i3) with real NVMe
> devices. That's what I'd be testing.

Yes, indeed, that's a better target for testing, thanks. I'll write back when
will get some results.