Is NVMe RAID useless (performance-wise) with PostgreSQL? - Mailing list pgsql-general
From | Dragan Milivojević |
---|---|
Subject | Is NVMe RAID useless (performance-wise) with PostgreSQL? |
Date | |
Msg-id | 514b55e0-342c-4ec9-b6b4-544af0415dd6@pkm-inc.com Whole thread Raw |
List | pgsql-general |
Hi all, While building a new PostgreSQL server, I realized that the performance with a single disk is the same or better than with a RAID0 4-disk array. All benchmarks were conducted using pgbench with a scaling factor of 2000. For a typical run with pgbench -j 4 -c 512 -P 60 -r -T 300 -b tpcb-like, these are the results: single disk run: latency average = 17.524 ms latency stddev = 6.904 ms tps = 28870 iostat: avg-cpu: %user %nice %system %iowait %steal %idle 26.10 0.00 21.02 2.95 0.00 49.93 | Device | r/s | rMB/s | rrqm/s | %rrqm | r_await | rareq-sz | w/s | wMB/s | wrqm/s | %wrqm | w_await | wareq-sz| f/s | f_await | aqu-sz | %util | |---------|:--------:|:------:|:------:|:-----:|:-------:|:--------:|:--------:|:------:|:------:|:-----:|:-------:|:--------:|:------:|:-------:|:------:|:-----:| | nvme0n1 | 28641.27 | 255.00 | 0.00 | 0.00 | 0.16 | 9.12 | 27665.67 | 458.09 | 0.00 | 0.00 | 0.09 | 16.96| 251.47 | 1.69 | 7.69 | 98.08 | RAID0 4 disk, 4K chunk: latency average = 22.269 ms latency stddev = 10.825 ms tps = 22742 avg-cpu: %user %nice %system %iowait %steal %idle 23.63 0.00 19.63 1.53 0.00 55.21 | Device | r/s | rMB/s | rrqm/s | %rrqm | r_await | rareq-sz | w/s | wMB/s | wrqm/s | %wrqm | w_await | wareq-sz| f/s | f_await | aqu-sz | %util | |---------|:--------:|:------:|:-------:|:-----:|:-------:|:--------:|:---------:|:------:|:------:|:-----:|:-------:|:--------:|:------:|:-------:|:------:|:-----:| | md127 | 55359.93 | 216.25 | 0.00 | 0.00 | 0.09 | 4.00 | 105629.07 | 412.61 | 0.00 | 0.00 | 0.04 | 4.00| 0.00 | 0.00 | 9.02 | 93.76 | | nvme1n1 | 12763.33 | 54.03 | 1067.47 | 7.72 | 0.08 | 4.33 | 26572.07 | 103.31 | 37.33 | 0.14 | 0.05 | 3.98| 162.53 | 1.74 | 2.67 | 99.18 | | nvme3n1 | 12753.07 | 53.97 | 1063.87 | 7.70 | 0.08 | 4.33 | 26560.47 | 103.26 | 37.40 | 0.14 | 0.05 | 3.98| 162.47 | 1.73 | 2.58 | 99.15 | | nvme4n1 | 12787.27 | 54.10 | 1062.80 | 7.67 | 0.09 | 4.33 | 26492.73 | 102.99 | 35.67 | 0.13 | 0.05 | 3.98| 162.53 | 1.69 | 2.67 | 99.07 | | nvme5n1 | 12796.53 | 54.15 | 1065.60 | 7.69 | 0.09 | 4.33 | 26505.67 | 103.04 | 35.73 | 0.13 | 0.05 | 3.98| 162.53 | 1.66 | 2.56 | 98.95 | BTW, if these tables are mangled in transport or by email clients, I posted this email to https://pastebin.com/raw/ZmsH0T5M. A 4K chunk is obviously not optimal, but I should still be getting around a 2x uplift. In the past, when tuning PostgreSQL, I tweaked various RAID parameters like chunk size, stripe_cache_size, etc. but it never occurred to me to check the performance against a single drive. Hence this email. I'm not sure if this is expected or if there is something wrong with my setup. Full system details are at the end of the message. While exploring this, I went deep down the rabbit hole, running hundreds of tests and trying dozens of configurations. The best I achieved was 31K TPS with a plain RAID0 256KB chunk, nvme poll_queues=4 and io_pool=1. This resulted in a measly 2% improvement compared to a single disk. A sample of results: Single disk nvme poll_queues=0 pgbench -j4 -c X tps avg latency ms latency stddev ms 1 477 2.096 0.258 4 1167 3.426 0.295 16 4408 3.623 0.545 64 12533 5.089 0.999 128 21295 5.979 1.538 256 28022 9.048 3.014 512 28870 17.524 6.904 Single disk nvme poll_queues=4 pgbench -j4 -c X tps avg latency ms latency stddev ms 128 22284 5.711 1.448 256 27390 9.240 2.848 512 30596 16.452 6.090 1024 26352 38.481 19.513 4 disk RAID 0 4KB Chunk nvme poll_queues=0 pgbench -j4 -c X tps avg latency ms latency stddev ms 128 17614 7.231 2.43 256 22347 11.37 4.922 512 22742 22.269 10.825 1024 20896 48.57 26.975 4 disk LVM RAID 0 4KB Chunk nvme poll_queues=4 pgbench -j4 -c X tps avg latency ms latency stddev ms 128 17423 7.312 1.991 256 22064 11.521 4.044 512 24875 20.373 9.421 1024 21242 47.692 25.843 4 disk RAID 0 8KB Chunk nvme poll_queues=4 pgbench -j4 -c X tps avg latency ms latency stddev ms 128 18907 6.736 2.094 256 24909 10.184 3.446 512 24878 20.331 8.448 1024 20535 49.665 27.462 4 disk RAID 0 64KB Chunk nvme poll_queues=4 pgbench -j4 -c X tps avg latency ms latency stddev ms 128 21393 5.951 2.102 256 27231 9.293 2.752 512 30261 16.624 5.986 1024 25245 40.285 21.317 4 disk RAID 0 256KB Chunk nvme poll_queues=4 pgbench -j4 -c X tps avg latency ms latency stddev ms 128 21055 6.046 1.972 256 27226 9.289 2.736 512 31174 16.120 5.858 1024 27069 37.355 17.797 4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=0 stripe_cache_size=256 pgbench -j4 -c X tps avg latency ms latency stddev ms 128 15368 8.291 3.527 256 19110 13.312 6.151 512 19796 25.667 13.191 4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=4 stripe_cache_size=256 pgbench -j4 -c X tps avg latency ms latency stddev ms 128 17143 7.431 2.353 256 21777 11.665 4.175 512 22602 22.433 9.239 1024 20291 50.189 26.191 4 disk RAID 5 4KB Chunk nvme poll_queues=4 group_thread_cnt=8 stripe_cache_size=256 pgbench -j4 -c X tps avg latency ms latency stddev ms 128 16767 7.598 2.698 256 21646 11.733 4.288 512 22161 22.879 9.811 4 disk RAID 5 256KB Chunk nvme poll_queues=4 group_thread_cnt=8 stripe_cache_size=4096 pgbench -j4 -c X tps avg latency ms latency stddev ms 128 15997 7.966 2.754 256 512 21745 23.293 9.805 The obvious conclusion is that there is something broken with Linux MD RAID, so I examined whether RAID0 scales with the simplest workloads, such as sequential reads. Unfortunately, I did not reach a definite conclusion. I posted my tests to the Linux RAID list but received no response (https://lore.kernel.org/linux-raid/1af8f1e0-4f41-4f25-bc34-f655a4c141b4@pkm-inc.com/T/#u). So my question is this: has anyone else tested whether MD RAID scales with PostgreSQL, and what results did you get? What is your typical setup when you have a bunch of NVMe drives in a shiny new server? System specs: Dell PowerEdge R7525, Dual AMD EPYC 7313, 32G DDR4 3200 Disks used for tests: "Samsung SSD 980 PRO with Heatsink 2TB" drives All drives under test are connected to the same processor. AlmaLinux release 9.4 (Seafoam Ocelot) Kernel 5.14.0-427.20.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC tuned-adm profile postgresql tweaks: poll_queues=4, io_poll = 1 version -------------------------------------------------------------------------------------------------------------- PostgreSQL 16.1 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3), 64-bit (1 row) name | current_setting | source ------------------------------+--------------------+-------------------- application_name | psql | client checkpoint_completion_target | 0.9 | configuration file client_encoding | UTF8 | client DateStyle | ISO, MDY | configuration file default_statistics_target | 100 | configuration file default_text_search_config | pg_catalog.english | configuration file dynamic_shared_memory_type | posix | configuration file effective_cache_size | 24GB | configuration file effective_io_concurrency | 200 | configuration file lc_messages | C.UTF-8 | configuration file lc_monetary | C.UTF-8 | configuration file lc_numeric | C.UTF-8 | configuration file lc_time | C.UTF-8 | configuration file log_filename | postgresql-%a.log | configuration file log_rotation_age | 1d | configuration file log_rotation_size | 0 | configuration file log_timezone | Europe/Belgrade | configuration file log_truncate_on_rotation | on | configuration file logging_collector | on | configuration file maintenance_work_mem | 2GB | configuration file max_connections | 1024 | configuration file max_wal_size | 8GB | configuration file min_wal_size | 2GB | configuration file random_page_cost | 1.1 | configuration file shared_buffers | 8GB | configuration file TimeZone | Europe/Belgrade | configuration file wal_buffers | 16MB | configuration file work_mem | 4MB | configuration file Thanks Dragan
pgsql-general by date: