Is NVMe RAID useless (performance-wise) with PostgreSQL? - Mailing list pgsql-general

From Dragan Milivojević
Subject Is NVMe RAID useless (performance-wise) with PostgreSQL?
Date
Msg-id 514b55e0-342c-4ec9-b6b4-544af0415dd6@pkm-inc.com
Whole thread Raw
List pgsql-general
Hi all,

While building a new PostgreSQL server, I realized that the performance with a single disk
is the same or better than with a RAID0 4-disk array.

All benchmarks were conducted using pgbench with a scaling factor of 2000.
For a typical run with pgbench -j 4 -c 512 -P 60 -r -T 300 -b tpcb-like, these are the results:


single disk run:

latency average = 17.524 ms
latency stddev = 6.904 ms
tps = 28870


iostat:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           26.10    0.00   21.02    2.95    0.00   49.93

| Device  |    r/s   |  rMB/s | rrqm/s | %rrqm | r_await | rareq-sz |    w/s   |  wMB/s | wrqm/s | %wrqm | w_await |
wareq-sz|   f/s  | f_await | aqu-sz | %util |
 

|---------|:--------:|:------:|:------:|:-----:|:-------:|:--------:|:--------:|:------:|:------:|:-----:|:-------:|:--------:|:------:|:-------:|:------:|:-----:|
| nvme0n1 | 28641.27 | 255.00 |   0.00 |  0.00 |    0.16 | 9.12 | 27665.67 | 458.09 |   0.00 |  0.00 |    0.09 |   
16.96| 251.47 |    1.69 |   7.69 | 98.08 |
 



RAID0 4 disk, 4K chunk:

latency average = 22.269 ms
latency stddev = 10.825 ms
tps = 22742


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           23.63    0.00   19.63    1.53    0.00   55.21

| Device  |    r/s   |  rMB/s |  rrqm/s | %rrqm | r_await | rareq-sz |    w/s    |  wMB/s | wrqm/s | %wrqm | w_await |
wareq-sz|   f/s  | f_await | aqu-sz | %util |
 

|---------|:--------:|:------:|:-------:|:-----:|:-------:|:--------:|:---------:|:------:|:------:|:-----:|:-------:|:--------:|:------:|:-------:|:------:|:-----:|
| md127   | 55359.93 | 216.25 |    0.00 |  0.00 |    0.09 | 4.00 | 105629.07 | 412.61 |   0.00 |  0.00 |    0.04 |    
4.00|   0.00 |    0.00 |   9.02 | 93.76 |
 
| nvme1n1 | 12763.33 |  54.03 | 1067.47 |  7.72 |    0.08 | 4.33 |  26572.07 | 103.31 |  37.33 |  0.14 |    0.05 |    
3.98| 162.53 |    1.74 |   2.67 | 99.18 |
 
| nvme3n1 | 12753.07 |  53.97 | 1063.87 |  7.70 |    0.08 | 4.33 |  26560.47 | 103.26 |  37.40 |  0.14 |    0.05 |    
3.98| 162.47 |    1.73 |   2.58 | 99.15 |
 
| nvme4n1 | 12787.27 |  54.10 | 1062.80 |  7.67 |    0.09 | 4.33 |  26492.73 | 102.99 |  35.67 |  0.13 |    0.05 |    
3.98| 162.53 |    1.69 |   2.67 | 99.07 |
 
| nvme5n1 | 12796.53 |  54.15 | 1065.60 |  7.69 |    0.09 | 4.33 |  26505.67 | 103.04 |  35.73 |  0.13 |    0.05 |    
3.98| 162.53 |    1.66 |   2.56 | 98.95 |
 

BTW, if these tables are mangled in transport or by email clients, I posted this email to
https://pastebin.com/raw/ZmsH0T5M.

A 4K chunk is obviously not optimal, but I should still be getting around a 2x uplift.

In the past, when tuning PostgreSQL, I tweaked various RAID parameters like chunk size, stripe_cache_size, etc.
but it never occurred to me to check the performance against a single drive. Hence this email.
I'm not sure if this is expected or if there is something wrong with my setup.
Full system details are at the end of the message.

While exploring this, I went deep down the rabbit hole, running hundreds of tests and trying dozens of configurations.
The best I achieved was 31K TPS with a plain RAID0 256KB chunk, nvme poll_queues=4 and io_pool=1.
This resulted in a measly 2% improvement compared to a single disk.


A sample of results:

Single disk nvme poll_queues=0
pgbench -j4 -c X    tps    avg latency ms  latency stddev ms
1                   477         2.096             0.258
4                   1167        3.426             0.295
16                  4408        3.623             0.545
64                  12533       5.089             0.999
128                 21295       5.979             1.538
256                 28022       9.048             3.014
512                 28870      17.524             6.904


Single disk nvme poll_queues=4
pgbench -j4 -c X    tps    avg latency ms  latency stddev ms
128                 22284       5.711           1.448
256                 27390       9.240           2.848
512                 30596       16.452          6.090
1024                26352       38.481          19.513


4 disk RAID 0 4KB Chunk nvme poll_queues=0
pgbench -j4 -c X    tps     avg latency ms  latency stddev ms
128                 17614       7.231           2.43
256                 22347       11.37           4.922
512                 22742       22.269          10.825
1024                20896       48.57           26.975


4 disk LVM RAID 0 4KB Chunk nvme poll_queues=4
pgbench -j4 -c X    tps     avg latency ms  latency stddev ms
128                 17423       7.312           1.991
256                 22064      11.521           4.044
512                 24875      20.373           9.421
1024                21242      47.692           25.843


4 disk RAID 0 8KB Chunk nvme poll_queues=4
pgbench -j4 -c X    tps     avg latency ms  latency stddev ms
128                 18907       6.736           2.094
256                 24909       10.184          3.446
512                 24878       20.331          8.448
1024                20535       49.665          27.462


4 disk RAID 0 64KB Chunk nvme poll_queues=4
pgbench -j4 -c X    tps     avg latency ms  latency stddev ms
128                 21393       5.951           2.102
256                 27231       9.293           2.752
512                 30261      16.624           5.986
1024                25245      40.285          21.317


4 disk RAID 0 256KB Chunk nvme poll_queues=4
pgbench -j4 -c X    tps     avg latency ms  latency stddev ms
128                 21055       6.046           1.972
256                 27226       9.289           2.736
512                 31174      16.120           5.858
1024                27069      37.355           17.797


4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=0 stripe_cache_size=256
pgbench -j4 -c X    tps     avg latency ms  latency stddev ms
128                 15368       8.291           3.527
256                 19110       13.312          6.151
512                 19796       25.667          13.191


4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=4 stripe_cache_size=256
pgbench -j4 -c X    tps     avg latency ms  latency stddev ms
128                 17143       7.431           2.353
256                 21777      11.665           4.175
512                 22602      22.433           9.239
1024                20291      50.189          26.191


4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=8 stripe_cache_size=256
pgbench -j4 -c X    tps     avg latency ms  latency stddev ms
128                 16767       7.598           2.698
256                 21646      11.733           4.288
512                 22161      22.879           9.811

4 disk RAID 5 256KB Chunk nvme poll_queues=4 group_thread_cnt=8 stripe_cache_size=4096
pgbench -j4 -c X    tps     avg latency ms  latency stddev ms
128                 15997       7.966          2.754
256
512                 21745      23.293          9.805

The obvious conclusion is that there is something broken with Linux MD RAID, so I examined
whether RAID0 scales with the simplest workloads, such as sequential reads.
Unfortunately, I did not reach a definite conclusion.
I posted my tests to the Linux RAID list but received no response
(https://lore.kernel.org/linux-raid/1af8f1e0-4f41-4f25-bc34-f655a4c141b4@pkm-inc.com/T/#u).

So my question is this: has anyone else tested whether MD RAID scales with PostgreSQL, and what results did you get?
What is your typical setup when you have a bunch of NVMe drives in a shiny new server?


System specs:


Dell PowerEdge R7525, Dual AMD EPYC 7313, 32G DDR4 3200
Disks used for tests: "Samsung SSD 980 PRO with Heatsink 2TB" drives
All drives under test are connected to the same processor.
AlmaLinux release 9.4 (Seafoam Ocelot)
Kernel 5.14.0-427.20.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC
tuned-adm profile postgresql
tweaks: poll_queues=4, io_poll = 1

version
--------------------------------------------------------------------------------------------------------------
  PostgreSQL 16.1 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3), 64-bit
(1 row)

              name             |  current_setting   | source
------------------------------+--------------------+--------------------
  application_name             | psql               | client
  checkpoint_completion_target | 0.9                | configuration file
  client_encoding              | UTF8               | client
  DateStyle                    | ISO, MDY           | configuration file
  default_statistics_target    | 100                | configuration file
  default_text_search_config   | pg_catalog.english | configuration file
  dynamic_shared_memory_type   | posix              | configuration file
  effective_cache_size         | 24GB               | configuration file
  effective_io_concurrency     | 200                | configuration file
  lc_messages                  | C.UTF-8            | configuration file
  lc_monetary                  | C.UTF-8            | configuration file
  lc_numeric                   | C.UTF-8            | configuration file
  lc_time                      | C.UTF-8            | configuration file
  log_filename                 | postgresql-%a.log  | configuration file
  log_rotation_age             | 1d                 | configuration file
  log_rotation_size            | 0                  | configuration file
  log_timezone                 | Europe/Belgrade    | configuration file
  log_truncate_on_rotation     | on                 | configuration file
  logging_collector            | on                 | configuration file
  maintenance_work_mem         | 2GB                | configuration file
  max_connections              | 1024               | configuration file
  max_wal_size                 | 8GB                | configuration file
  min_wal_size                 | 2GB                | configuration file
  random_page_cost             | 1.1                | configuration file
  shared_buffers               | 8GB                | configuration file
  TimeZone                     | Europe/Belgrade    | configuration file
  wal_buffers                  | 16MB               | configuration file
  work_mem                     | 4MB                | configuration file






Thanks
Dragan




pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: Definging columns for INSERT statements
Next
From: Chema
Date:
Subject: "permission denied to COPY to or from an external program" even with GRANT pg_execute_server_program