Thread: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

23 January 2019, 19:16:06

Hi,

Please pardon me if this question is already answered in the documentation, Wiki, or the mailing list archive. The problem is, that I don't know the exact term to search for - I've tried searching for "linear scalability" and "concurrency vs performance" but didn't find what I was looking for.

## MAIN QUESTION

pgbench -c 1 achieves approx 80 TPS

pgbench -c 6 should achieve approx 480 TPS, but only achieves 360 TPS

pgbench -c 12, should achieve approx 960 TPS, but only achieves 610 TPS

If pgbench is being run on a 4c/8t machine and pg-server is being run on a 6c/12t machine with 32GB RAM [1], and the two servers are connected with 1 Gbit/s connection, I don't think either pgbench or pg-server is being constrained by hardware, right?

In that case why is it not possible to achieve linear scalability, at least till 12 concurrent connections (i.e. the thread-count of pg-server)? What is an easy way to identify the limiting factor? Is it network connectivity? Disk IOPS? CPU load? Some config parameter?

## SECONDARY QUESTION

At what level of concurrent connections should settings like shared_buffers, effective_cache_size, max_wal_size start making a difference? With my hardware [1], I'm seeing a difference only after 48 concurrent connections. And that too it's just a 15-30% improvement over the default settings that ship with the Ubuntu 18.04 package. Is this expected? Isn't this allocating too many resources for too little gain?

## CONTEXT

I am currently trying to benchmark PG 11 (via pgbench) to figure out the configuration parameters that deliver optimum performance for my hardware [1] and workload [2]

Based on https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server I've made the following relevant changes to the default PG config on Ubuntu 18.04:

max_connection=400

work_mem=4MB

maintenance_work_mem=64MB

shared_buffers=12288MB

temp_buffers=8MB

effective_cache_size=16GB

wal_buffers=-1

wal_sync_method=fsync

max_wal_size=5GB

autovacuum=off # NOTE: Only for benchmarking

[1] 32 GB RAM - 6 core/12 thread - 2x SSD in RAID1

[2] SaaS webapp -- it's a mixed workload which looks a lot like TPC-B

Thanks,

Saurabh.

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

25 January 2019, 07:17:31

Is there any material on how to benchmark Postgres meaningfully? I'm getting very frustrated with the numbers that `pgbench` is reporting:

-- allocating more resources to Postgres seems to be randomly dropping performance

-- there seems to be no repeatability in the benchmarking numbers [1]

-- there is no to figure out what is causing a bottleneck and which knob/setting is going to alleviate it.

How do the PG wizards figure all this out?

[1] https://dba.stackexchange.com/questions/227790/pgbench-20-30-variation-in-benchmark-results-non-repeatable-benchmarks

-- Saurabh.

On Thu, Jan 24, 2019 at 12:46 AM Saurabh Nanda <saurabhnanda@gmail.com> wrote:

Hi,

Please pardon me if this question is already answered in the documentation, Wiki, or the mailing list archive. The problem is, that I don't know the exact term to search for - I've tried searching for "linear scalability" and "concurrency vs performance" but didn't find what I was looking for.

## MAIN QUESTION

pgbench -c 1 achieves approx 80 TPS
pgbench -c 6 should achieve approx 480 TPS, but only achieves 360 TPS
pgbench -c 12, should achieve approx 960 TPS, but only achieves 610 TPS

If pgbench is being run on a 4c/8t machine and pg-server is being run on a 6c/12t machine with 32GB RAM [1], and the two servers are connected with 1 Gbit/s connection, I don't think either pgbench or pg-server is being constrained by hardware, right?

In that case why is it not possible to achieve linear scalability, at least till 12 concurrent connections (i.e. the thread-count of pg-server)? What is an easy way to identify the limiting factor? Is it network connectivity? Disk IOPS? CPU load? Some config parameter?

## SECONDARY QUESTION

At what level of concurrent connections should settings like shared_buffers, effective_cache_size, max_wal_size start making a difference? With my hardware [1], I'm seeing a difference only after 48 concurrent connections. And that too it's just a 15-30% improvement over the default settings that ship with the Ubuntu 18.04 package. Is this expected? Isn't this allocating too many resources for too little gain?

## CONTEXT

I am currently trying to benchmark PG 11 (via pgbench) to figure out the configuration parameters that deliver optimum performance for my hardware [1] and workload [2]

Based on https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server I've made the following relevant changes to the default PG config on Ubuntu 18.04:

max_connection=400
work_mem=4MB
maintenance_work_mem=64MB
shared_buffers=12288MB
temp_buffers=8MB
effective_cache_size=16GB
wal_buffers=-1
wal_sync_method=fsync
max_wal_size=5GB
autovacuum=off # NOTE: Only for benchmarking

[1] 32 GB RAM - 6 core/12 thread - 2x SSD in RAID1
[2] SaaS webapp -- it's a mixed workload which looks a lot like TPC-B

Thanks,
Saurabh.

http://www.saurabhnanda.com

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

26 January 2019, 11:10:55

Hi Jeff,

Thank you for replying.

wal_sync_method=fsync

Why this change?

Actually, I re-checked and noticed that this config section was left to it's default values, which is the following. Since the commented line said `wal_sync_method = fsync`, I _assumed_ that's the default value. But it seems that Linux uses fdatasync, by default, and the output of pg_test_fsync also shows that it is /probably/ the fastest method on my hardware.

# wal_sync_method = fsync

# the default is the first option supported by the operating system:

# open_datasync

# fdatasync (default on Linux)

# fsync

# fsync_writethrough

# open_sync

PGOPTIONS="-c synchronous_commit=off" pgbench -T 3600 -P 10 ....

I am currently running all my benchmarks with synchronous_commit=off and will get back with my findings.

You could also try pg_test_fsync to get low-level information, to supplement the high level you get from pgbench.

Thanks for pointing me to this tool. never knew pg_test_fsync existed! I've run `pg_test_fsync -s 60` two times and this is the output - https://gist.github.com/saurabhnanda/b60e8cf69032b570c5b554eb50df64f8 I'm not sure what to make of it? Shall I tweak the setting of `wal_sync_method` to something other than the default value?

The effects of max_wal_size are going to depend on how you have IO configured, for example does pg_wal shared the same devices and controllers as the base data? It is mostly about controlling disk usage and crash-recovery performance, neither of which is of primary importance to pgbench performance.

The WAL and the data-directory reside on the same SSD disk -- is this a bad idea? I was under the impression that smaller values for max_wal_size cause pg-server to do "maintenance work" related to wal rotation, etc. more frequently and would lead to lower pgbench performance.

Not all SSD are created equal, so the details here matter, both for the underlying drives and the raid controller.

Here's the relevant output from lshw -- https://gist.github.com/saurabhnanda/d7107d4ab1bb48e94e0a5e3ef96e7260 It seems I have Micron SSDs. I tried finding more information on RAID but couldn't get anything in the lshw or lspci output except the following -- `SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)`. Moreover, the devices are showing up as /dev/md1, /dev/md2, etc. So, if my understanding is correct, I don't think I'm on hardware RAID, but software RAID, right?

These machines are from the EX-line of dedicated servers provided by Hetzner, btw.

PS: Cc-ing the list back again because I assume you didn't intend for your reply to be private, right?

-- Saurabh.

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

27 January 2019, 07:39:16

PGOPTIONS="-c synchronous_commit=off" pgbench -T 3600 -P 10 ....

I am currently running all my benchmarks with synchronous_commit=off and will get back with my findings.

It seems that PGOPTIONS="-c synchronous_commit=off" has a significant impact. However, I still can not understand why the TPS for the optimised case is LOWER than the default for higher concurrency levels!

+--------+---------------------+------------------------+
| client | Mostly defaults [1] | Optimised settings [2] |
+--------+---------------------+------------------------+
| 1      | 80-86               | 169-180                |
+--------+---------------------+------------------------+
| 6      | 350-376             | 1265-1397              |
+--------+---------------------+------------------------+
| 12     | 603-619             | 1746-2352              |
+--------+---------------------+------------------------+
| 24     | 947-1015            | 1869-2518              |
+--------+---------------------+------------------------+
| 48     | 1435-1512           | 1912-2818              |
+--------+---------------------+------------------------+
| 96     | 1769-1811           | 1546-1753              |
+--------+---------------------+------------------------+
| 192    | 1857-1992           | 1332-1508              |
+--------+---------------------+------------------------+
| 384    | 1667-1793           | 1356-1450              |
+--------+---------------------+------------------------+

[1] "Mostly default" settings are whatever ships with Ubuntu 18.04 + PG 11. A snippet of the relevant setts are given below:

max_connection=400

work_mem=4MB

maintenance_work_mem=64MB

shared_buffers=128MB

temp_buffers=8MB

effective_cache_size=4GB

wal_buffers=-1

wal_sync_method=fsync

max_wal_size=1GB

autovacuum=off # Auto-vacuuming was disabled

[2] An optimised version of settings was obtained from https://pgtune.leopard.in.ua/#/ and along with that the benchmarks were run with PGOPTIONS="-c synchronous_commit=off"

max_connections = 400

shared_buffers = 8GB

effective_cache_size = 24GB

maintenance_work_mem = 2GB

checkpoint_completion_target = 0.7

wal_buffers = 16MB

default_statistics_target = 100

random_page_cost = 1.1

effective_io_concurrency = 200

work_mem = 3495kB

min_wal_size = 1GB

max_wal_size = 2GB

max_worker_processes = 12

max_parallel_workers_per_gather = 6

max_parallel_workers = 12

autovacuum=off # Auto-vacuuming was disabled

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Justin Pryzby

Date:

27 January 2019, 16:04:06

On Sun, Jan 27, 2019 at 01:09:16PM +0530, Saurabh Nanda wrote:
> It seems that PGOPTIONS="-c synchronous_commit=off" has a significant
> impact. However, I still can not understand why the TPS for the optimised
> case is LOWER than the default for higher concurrency levels!

Do you know which of the settings is causing lower TPS ?

I suggest to check shared_buffers.

If you haven't done it, disabling THP and KSM can resolve performance issues,
esp. with large RAM like shared_buffers, at least with older kernels.
https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com

Justin

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

27 January 2019, 16:38:55

Do you know which of the settings is causing lower TPS ?

I suggest to check shared_buffers.

I'm trying to find this, but it's taking a lot of time in re-running the benchmarks changing one config setting at a time. Thanks for the tip related to shared_buffers.

If you haven't done it, disabling THP and KSM can resolve performance issues,
esp. with large RAM like shared_buffers, at least with older kernels.
https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com

Is this a well-known performance "hack"? Is there any reason why it is not mentioned at https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server ? Are the stability implications of fiddling with THP and KSM well-known? Also, wrt KSM, my understand was that when a process forks the process' memory is anyways "copy on write", right? What other kind of pages would end-up being de-duplicated by ksmd? (Caveat: This is the first time I'm hearing about KSM and my knowledge is based off a single reading of https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.html )

-- Saurabh.

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Jeff Janes

Date:

27 January 2019, 17:48:12

You could also try pg_test_fsync to get low-level information, to supplement the high level you get from pgbench.

Thanks for pointing me to this tool. never knew pg_test_fsync existed! I've run `pg_test_fsync -s 60` two times and this is the output - https://gist.github.com/saurabhnanda/b60e8cf69032b570c5b554eb50df64f8 I'm not sure what to make of it?

I don't know what to make of that either. I'd expect fdatasync using two 8kB writes to be about the same throughput as using one 8kB write, but instead it is 4 times slower. Also, I'd expect open_datasync to get slower by a factor of 2, not a factor of 8, when going from one to two 8kB writes (that is not directly relevant, as you aren't using open_datasync, but is curious nonetheless). Is this reproducible with different run lengths? I wonder if your write cache (or something) gets "tired" during the first part of pg_test_fsync and thus degrades the subsequent parts of the test. I would say something in your IO stack is not optimal, maybe some component is "consumer grade" rather than "server grade". Maybe you can ask Hetzner about that.

The effects of max_wal_size are going to depend on how you have IO configured, for example does pg_wal shared the same devices and controllers as the base data? It is mostly about controlling disk usage and crash-recovery performance, neither of which is of primary importance to pgbench performance.

The WAL and the data-directory reside on the same SSD disk -- is this a bad idea?

If you are trying to squeeze out every last bit of performance, then I think it is bad idea. Or at least, something to try the alternative and see. The flushing that occurs during checkpoints and the flushing that occurs for every commit can interfere with each other.

I was under the impression that smaller values for max_wal_size cause pg-server to do "maintenance work" related to wal rotation, etc. more frequently and would lead to lower pgbench performance.

If you choose ridiculously small values it would. But once the value is sufficient, increasing it further wouldn't do much. Given your low level of throughput, I would think the default is already sufficient.

Thanks for including the storage info. Nothing about it stands out to me as either good or bad, but I'm not a hardware maven; hopefully one will be reading along and speak up.

PS: Cc-ing the list back again because I assume you didn't intend for your reply to be private, right?

Yes, I had intended to include the list but hit the wrong button, sorry.

Cheers,

Jeff

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Jeff Janes

Date:

27 January 2019, 19:41:57

On Sun, Jan 27, 2019 at 2:39 AM Saurabh Nanda <saurabhnanda@gmail.com> wrote:

PGOPTIONS="-c synchronous_commit=off" pgbench -T 3600 -P 10 ....

I am currently running all my benchmarks with synchronous_commit=off and will get back with my findings.

It seems that PGOPTIONS="-c synchronous_commit=off" has a significant impact.

It is usually not acceptable to run applications with synchronous_commit=off, so once you have identified that the bottleneck is in implementing synchronous_commit=on, you probably need to take a deep dive into your hardware to figure out why it isn't performing the way you need/want/expect it to. Tuning the server under synchronous_commit=off when you don't intend to run your production server with that setting is unlikely to be fruitful.

> However, I still can not understand why the TPS for the optimised case is LOWER than the default for higher concurrency levels!

In case you do intend to run with synchronous_commit=off, or if you are just curious: running with a very high number of active connections often reveals subtle bottlenecks and interactions, and is very dependent on your hardware. Unless you actually intend to run our server with synchronous_commit=off and with a large number of active connections, it is probably not worth investigating this. You can make a hobby of it, of course, but it is a time consuming hobby to have. If you do want to, I think you should start out with your optimized settings and revert them one at a time to find the one the caused the performance regression.

I'm more interested in the low end, you should do much better than those reported numbers when clients=1 and synchronous_commit=off with the data on SSD. I think you said that pgbench is running on a different machine than the database, so perhaps it is just network overhead that is keeping this value down. What happens if you run them on the same machine?

> +--------+---------------------+------------------------+
> | client | Mostly defaults [1] | Optimised settings [2] |
> +--------+---------------------+------------------------+
> | 1 | 80-86 | 169-180 |
> +--------+---------------------+------------------------+

Cheers,

Jeff

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

28 January 2019, 04:52:17

It is usually not acceptable to run applications with synchronous_commit=off, so once you have identified that the bottleneck is in implementing synchronous_commit=on, you probably need to take a deep dive into your hardware to figure out why it isn't performing the way you need/want/expect it to. Tuning the server under synchronous_commit=off when you don't intend to run your production server with that setting is unlikely to be fruitful.

I do not intend to run the server with synchronous_commit=off, but based on my limited knowledge, I'm wondering if all these observations are somehow related and are caused by the same underlying bottleneck (or misconfiguration):

1) At higher concurrency levels, TPS for synchronous_commit=off is lower for optimised settings when compared to default settings

2) At ALL concurrency levels, TPS for synchronous_commit=on is lower for optimised settings (irrespective of shared_buffers value), compared to default settings

3) At higher concurrency levels, optimised + synchronous_commit=on + shared_buffers=2G has HIGHER TPS than optimised + synchronous_commit=off + shared_buffers=8G

Here are the (completely counter-intuitive) numbers for these observations:

+--------+-----------------------------------------------------------------+------------------------+
|        |                      synchronous_commit=on                      | synchronous_commit=off |
+--------+-----------------------------------------------------------------+------------------------+
| client | Mostly defaults [1] | Optimised [2]       | Optimised [2]       | Optimised [2]          |
|        |                     | + shared_buffers=2G | + shared_buffers=8G | + shared_buffers=8G    |
+--------+---------------------+---------------------+---------------------+------------------------+
| 1      | 80-86               | 74-77               | 75-75               | 169-180                |
+--------+---------------------+---------------------+---------------------+------------------------+
| 6      | 350-376             | 301-304             | 295-300             | 1265-1397              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 12     | 603-619             | 476-488             | 485-493             | 1746-2352              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 24     | 947-1015            | 678-739             | 723-770             | 1869-2518              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 48     | 1435-1512           | 950-1043            | 1029-1086           | 1912-2818              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 96     | 1769-1811           | 3337-3459           | 1302-1346           | 1546-1753              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 192    | 1857-1992           | 3613-3715           | 1269-1345           | 1332-1508              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 384    | 1667-1793           | 3180-3300           | 1262-1364           | 1356-1450              |
+--------+---------------------+---------------------+---------------------+------------------------+

In case you do intend to run with synchronous_commit=off, or if you are just curious: running with a very high number of active connections often reveals subtle bottlenecks and interactions, and is very dependent on your hardware. Unless you actually intend to run our server with synchronous_commit=off and with a large number of active connections, it is probably not worth investigating this.

Please see the table above. The reason why I'm digging deeper into this is because of observation (2) above, i.e. I am unable to come up with any optimised setting that performs better than the default settings for the concurrency levels that I care about (100-150).

I'm more interested in the low end, you should do much better than those reported numbers when clients=1 and synchronous_commit=off with the data on SSD. I think you said that pgbench is running on a different machine than the database, so perhaps it is just network overhead that is keeping this value down. What happens if you run them on the same machine?

I'm currently running this, but the early numbers are surprising. For client=1, the numbers for optimised settings + shared_buffers=2G are:

-- pgbench run over a 1Gbps network: 74-77 tps

-- pgbench run on the same machine: 152-153 tps (is this absolute number good enough given my hardware?)

Is 1 Gbps network the bottleneck? Does it explain the three observations given above? I'll wait for the current set of benchmarks to finish and re-run the benchmarks over the network and monitor network utilisation.

[1] "Mostly default" settings are whatever ships with Ubuntu 18.04 + PG 11. A snippet of the relevant setts are given below:

max_connection=400

work_mem=4MB

maintenance_work_mem=64MB

shared_buffers=128MB

temp_buffers=8MB

effective_cache_size=4GB

wal_buffers=-1

wal_sync_method=fsync

max_wal_size=1GB

autovacuum=off # Auto-vacuuming was disabled

[2] Optimized settings

max_connections = 400

shared_buffers = 8GB # or 2GB -- depending upon which scenario was being evaluated

effective_cache_size = 24GB

maintenance_work_mem = 2GB

checkpoint_completion_target = 0.7

wal_buffers = 16MB

default_statistics_target = 100

random_page_cost = 1.1

effective_io_concurrency = 200

work_mem = 3495kB

min_wal_size = 1GB

max_wal_size = 2GB

max_worker_processes = 12

max_parallel_workers_per_gather = 6

max_parallel_workers = 12

autovacuum=off # Auto-vacuuming was disabled

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

28 January 2019, 04:59:41

Here's the previous table again -- trying to prevent the wrapping.

+--------+------------------------------------------------+-----------------+
|        |              synchronous_commit=on             | sync_commit=off |
+--------+------------------------------------------------+-----------------+
| client | Defaults [1] | Optimised [2]  | Optimised [2]  | Optimised [2]   |
|        |              | (buffers=2G)   | (buffers=8G)   | (buffers=8G)    |
+--------+--------------+----------------+----------------+-----------------+
| 1      | 80-86        | 74-77          | 75-75          | 169-180         |
+--------+--------------+----------------+----------------+-----------------+
| 6      | 350-376      | 301-304        | 295-300        | 1265-1397       |
+--------+--------------+----------------+----------------+-----------------+
| 12     | 603-619      | 476-488        | 485-493        | 1746-2352       |
+--------+--------------+----------------+----------------+-----------------+
| 24     | 947-1015     | 678-739        | 723-770        | 1869-2518       |
+--------+--------------+----------------+----------------+-----------------+
| 48     | 1435-1512    | 950-1043       | 1029-1086      | 1912-2818       |
+--------+--------------+----------------+----------------+-----------------+
| 96     | 1769-1811    | 3337-3459      | 1302-1346      | 1546-1753       |
+--------+--------------+----------------+----------------+-----------------+
| 192    | 1857-1992    | 3613-3715      | 1269-1345      | 1332-1508       |
+--------+--------------+----------------+----------------+-----------------+
| 384    | 1667-1793    | 3180-3300      | 1262-1364      | 1356-1450       |
+--------+--------------+----------------+----------------+-----------------+

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

28 January 2019, 05:03:21

All this benchmarking has led me to a philosophical question, why does PG need shared_buffers in the first place? What's wrong with letting the OS do the caching/buffering? Isn't it optimised for this kind of stuff?

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

28 January 2019, 14:03:31

An update. It seems (to my untrained eye) that something is wrong with the second SSD in the RAID configuration. Here's my question on serverfault related to what I saw with iostat - https://serverfault.com/questions/951096/difference-in-utilisation-reported-by-iostat-for-two-identical-disks-in-raid1

I've disabled RAID and rebooted the server to run the benchmarks with client=1,4,8,12 with shared_buffers=8MB (default) vs shared_buffers=2GB (optimised?) and will report back.

-- Saurabh.

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Nicolas Charles

Date:

28 January 2019, 14:22:19

Le 28/01/2019 à 15:03, Saurabh Nanda a écrit :
> An update. It seems (to my untrained eye) that something is wrong with 
> the second SSD in the RAID configuration. Here's my question on 
> serverfault related to what I saw with iostat - 
>
https://serverfault.com/questions/951096/difference-in-utilisation-reported-by-iostat-for-two-identical-disks-in-raid1
>
> I've disabled RAID and rebooted the server to run the benchmarks with 
> client=1,4,8,12 with shared_buffers=8MB (default) vs 
> shared_buffers=2GB (optimised?) and will report back.
>
>
You should probably include the detailed hardware you are working on - 
especially for the SSD, the model can have a big impact, as well as its 
wear.


Nicolas

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

28 January 2019, 15:55:42

You should probably include the detailed hardware you are working on -
especially for the SSD, the model can have a big impact, as well as its
wear.

What's the best tool to get meaningful information for SSD drives?

-- Saurabh.

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Nicolas Charles

Date:

28 January 2019, 16:06:14

Le 28/01/2019 à 16:55, Saurabh Nanda a écrit :

You should probably include the detailed hardware you are working on -
especially for the SSD, the model can have a big impact, as well as its
wear.

What's the best tool to get meaningful information for SSD drives?

smartctl is a good start

Nicolas

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

28 January 2019, 16:35:05

smartctl is a good start

Here's the output of `smartctl --xall /dev/sda` -- https://gist.github.com/saurabhnanda/ec3c95c1eb3896b3efe55181e7c78dde

I've disabled RAID so /dev/sda is the only disk which is being currently used.

I'm still seeing very weird numbers. There seems to be absolutely no difference in performance if I increase shared_buffers from 8MB to 2GB. Is there some other setting which needs to be changed to take advantage of the increase in shared_buffers? Also, something even weirder is happening for client=1.

Is my benchmarking script alright - https://gist.github.com/saurabhnanda/78e8523cf079ce7b78704acb1aa8d9fc ?

+--------+--------------+----------------+
| client | Defaults [1] | buffers=2G [2] |
+--------+--------------+----------------+
| 1      | 348-475 (??) | 529-583 (??)   |
+--------+--------------+----------------+
| 4      | 436-452      | 451-452        |
+--------+--------------+----------------+
| 8      | 862-869      | 859-861        |
+--------+--------------+----------------+
| 12     | 1210-1219    | 1220-1225      |
+--------+--------------+----------------+

[1] Default settings

checkpoint_completion_target=0.5

default_statistics_target=100

effective_io_concurrency=1

max_parallel_workers=8

max_parallel_workers_per_gather=2

max_wal_size=1024 MB

max_worker_processes=20

min_wal_size=80 MB

random_page_cost=4

shared_buffers=1024 8kB

wal_buffers=32 8kB

work_mem=4096 kB

[2] Increased shared_buffers

checkpoint_completion_target=0.5

default_statistics_target=100

effective_io_concurrency=1

max_parallel_workers=8

max_parallel_workers_per_gather=2

max_wal_size=1024 MB

max_worker_processes=20

min_wal_size=80 MB

random_page_cost=4

shared_buffers=262144 8kB

wal_buffers=2048 8kB

work_mem=4096 kB

-- Saurabh.

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

28 January 2019, 16:56:49

I've disabled transpare huge-pages and enabled huge_pages as given below. Let's see what happens. (I feel like a monkey pressing random buttons trying to turn a light bulb on... and I'm sure the monkey would've had it easier!)

AnonHugePages: 0 kB

ShmemHugePages: 0 kB

HugePages_Total: 5000

HugePages_Free: 4954

HugePages_Rsvd: 1015

HugePages_Surp: 0

Hugepagesize: 2048 kB

-- Saurabh.

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

29 January 2019, 01:44:15

Do you know which of the settings is causing lower TPS ?

I suggest to check shared_buffers.

If you haven't done it, disabling THP and KSM can resolve performance issues,
esp. with large RAM like shared_buffers, at least with older kernels.
https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com

I've tried reducing the number of variables to a bare minimum and have the following three cases now with RAID disabled:

a) only default settings [1]

b) default settings with shared_buffers=2G [2]

c) default settings with shared_buffers=2G & huge_pages=on [3]

The numbers are still not making any sense whatsoever.

+--------+--------------+----------------+---------------+
| client | Defaults [1] | buffers=2G [2] | buffers=2G    |
|        |              |                | huge_pages=on |
+--------+--------------+----------------+---------------+
| 1      | 348-475 (??) | 529-583 (??)   | 155-290       |
+--------+--------------+----------------+---------------+
| 4      | 436-452      | 451-452        | 388-403       |
+--------+--------------+----------------+---------------+
| 8      | 862-869      | 859-861        | 778-781       |
+--------+--------------+----------------+---------------+
| 12     | 1210-1219    | 1220-1225      | 1110-1111     |
+--------+--------------+----------------+---------------+

[1] Default settings

checkpoint_completion_target=0.5

default_statistics_target=100

effective_io_concurrency=1

max_parallel_workers=8

max_parallel_workers_per_gather=2

max_wal_size=1024 MB

max_worker_processes=20

min_wal_size=80 MB

random_page_cost=4

shared_buffers=1024 8kB

wal_buffers=32 8kB

work_mem=4096 kB

[2] Increased shared_buffers

checkpoint_completion_target=0.5

default_statistics_target=100

effective_io_concurrency=1

max_parallel_workers=8

max_parallel_workers_per_gather=2

max_wal_size=1024 MB

max_worker_processes=20

min_wal_size=80 MB

random_page_cost=4

shared_buffers=262144 8kB

wal_buffers=2048 8kB

work_mem=4096 kB

[3] Same settings as [2] with huge_pages=on and the following changes:

$ cat /sys/kernel/mm/transparent_hugepage/enabled

always madvise [never]

$ cat /proc/meminfo |grep -i huge

AnonHugePages: 0 kB

ShmemHugePages: 0 kB

HugePages_Total: 5000

HugePages_Free: 3940

HugePages_Rsvd: 1

HugePages_Surp: 0

Hugepagesize: 2048 kB

-- Saurabh.

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

29 January 2019, 04:48:43

Yet another update:

a) I've tried everything with me EX41-SSD server on Hetzner and nothing is increasing the performance over & above the default configuration.

b) I tried commissioning a new EX41-SSD server and was able to replicate the same pathetic performance numbers.

c) I tried another cloud hosting provider (E2E Networks) and just the raw performance numbers (with default configuration) are blowing Hetzner out of the water.

This leads me to believe that my assumption of the first hardware (or SSD) being faulty is incorrect. Something is wrong with either the EX41-SSD hardware or the out-of-box configuration. I'm commissioning something from their PX line (which is marked as "Datacenter Edition") and checking if that makes things better.

+--------+--------------+------------------+
| client | Hetzner      | E2E Networks     |
|        | EX41-SSD [1] | Cloud Server [2] |
+--------+--------------+------------------+
| 1      | ~160         | ~400             |
+--------+--------------+------------------+
| 4      | ~460         | ~1450            |
+--------+--------------+------------------+
| 8      | ~850         | ~2600            |
+--------+--------------+------------------+
| 12     | ~1200        | ~4000            |
+--------+--------------+------------------+

[1] lshw output for Hetzner - https://gist.github.com/saurabhnanda/613813d0d58fe1a406a8ce9b62ad10a9

[2] lshw output for E2E - https://gist.github.com/saurabhnanda/d276603990aa773269bad35f335344eb - since this is a cloud server low-level hardware info is not available. It's advertised as a 9vCPU + 30GB RAM + SSD cloud instance.

-- Saurabh.

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Saurabh Nanda

Date:

29 January 2019, 06:15:08

c) I tried another cloud hosting provider (E2E Networks) and just the raw performance numbers (with default configuration) are blowing Hetzner out of the water.

I noticed that on E2E, the root filesystem is mounted with the following options:

/dev/xvda on / type ext4

(rw,noatime,nodiratime,nobarrier,errors=remount-ro,stripe=512,data=ordered)

whereas on Hetzner, it is mounted with the following options:

/dev/nvme0n1p3 on / type ext4

(rw,relatime,data=ordered)

How much of a difference can this have on absolute TPS numbers?

-- Saurabh.

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Nicolas Charles

Date:

29 January 2019, 09:14:13

Le 29/01/2019 à 07:15, Saurabh Nanda a écrit :

c) I tried another cloud hosting provider (E2E Networks) and just the raw performance numbers (with default configuration) are blowing Hetzner out of the water.

I noticed that on E2E, the root filesystem is mounted with the following options:

/dev/xvda on / type ext4
(rw,noatime,nodiratime,nobarrier,errors=remount-ro,stripe=512,data=ordered)

whereas on Hetzner, it is mounted with the following options:

/dev/nvme0n1p3 on / type ext4
(rw,relatime,data=ordered)

How much of a difference can this have on absolute TPS numbers?

Differences can be significative. noatime does not update inode access time, while relatime updates the inode access time if the change time was before access time (which can be often the case for a database)

nobarrier disable block-level write barriers. Barriers ensure that data is effectively stored on system, The man command says: "If disabled on a device with a volatile (non-battery-backed) write-back cache, the nobarrier option will lead to filesystem corruption on a system crash or power loss."

You should probably consider noatime compared to relatime, and nobarriers depends if you have a battery or not

Also, this is an SSD, so you should TRIM it, either with preiodical fstrim, or using the discard option

Nicolas

Re: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

From

Jeff Janes

Date:

29 January 2019, 15:42:12

On Mon, Jan 28, 2019 at 12:03 AM Saurabh Nanda <saurabhnanda@gmail.com> wrote:

All this benchmarking has led me to a philosophical question, why does PG need shared_buffers in the first place?

PostgreSQL cannot let the OS get its hands on a dirty shared buffer until the WAL record "protecting" that buffer has been flushed to disk. If a dirty shared buffer got written to disk, but then a crash happened before the WAL record go flushed to disk, then the data could be corrupted when it comes back up. So shared_buffers effectively serves as cooling pond where dirty buffers wait for their WAL to be flushed naturally so they can be written without instigating a performance-reducing flush just for them.

Also, concurrent clients needs to access the same disk pages at overlapping times without corrupting each other. Perhaps that could be implemented to have just the buffer headers in shared memory to coordinate the locking, and not having the buffers themselves in shared memory. But that is not how it is currently implemented.

What's wrong with letting the OS do the caching/buffering?

Nothing, and that is what it does. Which is why the advice for shared_buffers is often to use a small fraction of RAM, leaving the rest for the OS to do its thing. But PostgreSQL still needs a way to lock those pages, both against concurrent access by its own clients, and against getting flushed out of order by the OS. There is no performant way to release the dirty pages immediately to the OS while still constraining the order in which the OS flushes them to disk.

Finally, while reading a page from the OS cache into shared_buffers is much faster than reading it from disk, it is still much slower than finding it already located in shared_buffers. So if your entire database fits in RAM, you will get better performance if shared_buffers is large enough for the entire thing to fit in there, as well. This is an exception to the rule that shared_buffers should be a small fraction of RAM.

Isn't it optimised for this kind of stuff?

Maybe. But you might be surprised at poorly optimized it is. It depends on your OS and version of it, of course. If you have a high usage_count buffer which is re-dirtied constantly, it will only get written and flushed to disk once per checkpoint if under PostgreSQL control. But I've seen pages like that get written many times per second under kernel control. Whatever optimization it tried to do, it wasn't very good at. Also, if many contiguous pages are dirtied in a close time-frame, but not dirtied in their physical order, the kernel should be able to re-order them into long sequential writes, correct? But empirically, it doesn't, at least back in the late 2.* series kernels when I did the experiments. I don't know if it didn't even try, or tried but failed. (Of course back then, PostgreSQL didn't do a good job of it either)

Cheers,

Jeff