Thread: Benchmarking: How to identify bottleneck (limiting factor) andachieve "linear scalability"?

Hi,

Please pardon me if this question is already answered in the documentation, Wiki, or the mailing list archive. The problem is, that I don't know the exact term to search for - I've tried searching for "linear scalability" and "concurrency vs performance" but didn't find what I was looking for.

## MAIN QUESTION

pgbench -c 1 achieves approx 80 TPS
pgbench -c 6 should achieve approx 480 TPS, but only achieves 360 TPS
pgbench -c 12, should achieve approx 960 TPS, but only achieves 610 TPS

If pgbench is being run on a 4c/8t machine and pg-server is being run on a 6c/12t machine with 32GB RAM [1], and the two servers are connected with 1 Gbit/s connection, I don't think either pgbench or pg-server is being constrained by hardware, right?

In that case why is it not possible to achieve linear scalability, at least till 12 concurrent connections (i.e. the thread-count of pg-server)? What is an easy way to identify the limiting factor? Is it network connectivity? Disk IOPS? CPU load? Some config parameter?

## SECONDARY QUESTION

At what level of concurrent connections should settings like shared_buffers, effective_cache_size, max_wal_size start making a difference? With my hardware [1], I'm seeing a difference only after 48 concurrent connections. And that too it's just a 15-30% improvement over the default settings that ship with the Ubuntu 18.04 package. Is this expected? Isn't this allocating too many resources for too little gain?

## CONTEXT

I am currently trying to benchmark PG 11 (via pgbench) to figure out the configuration parameters that deliver optimum performance for my hardware [1] and workload [2]

Based on https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server I've made the following relevant changes to the default PG config on Ubuntu 18.04:

    max_connection=400
    work_mem=4MB
    maintenance_work_mem=64MB
    shared_buffers=12288MB
    temp_buffers=8MB
    effective_cache_size=16GB
    wal_buffers=-1
    wal_sync_method=fsync
    max_wal_size=5GB
    autovacuum=off  # NOTE: Only for benchmarking

[1] 32 GB RAM - 6 core/12 thread - 2x SSD in RAID1
[2] SaaS webapp -- it's a mixed workload which looks a lot like TPC-B

Thanks,
Saurabh.
Is there any material on how to benchmark Postgres meaningfully? I'm getting very frustrated with the numbers that `pgbench` is reporting:

-- allocating more resources to Postgres seems to be randomly dropping performance
-- there seems to be no repeatability in the benchmarking numbers [1]
-- there is no to figure out what is causing a bottleneck and which knob/setting is going to alleviate it.

How do the PG wizards figure all this out?


-- Saurabh.

On Thu, Jan 24, 2019 at 12:46 AM Saurabh Nanda <saurabhnanda@gmail.com> wrote:
Hi,

Please pardon me if this question is already answered in the documentation, Wiki, or the mailing list archive. The problem is, that I don't know the exact term to search for - I've tried searching for "linear scalability" and "concurrency vs performance" but didn't find what I was looking for.

## MAIN QUESTION

pgbench -c 1 achieves approx 80 TPS
pgbench -c 6 should achieve approx 480 TPS, but only achieves 360 TPS
pgbench -c 12, should achieve approx 960 TPS, but only achieves 610 TPS

If pgbench is being run on a 4c/8t machine and pg-server is being run on a 6c/12t machine with 32GB RAM [1], and the two servers are connected with 1 Gbit/s connection, I don't think either pgbench or pg-server is being constrained by hardware, right?

In that case why is it not possible to achieve linear scalability, at least till 12 concurrent connections (i.e. the thread-count of pg-server)? What is an easy way to identify the limiting factor? Is it network connectivity? Disk IOPS? CPU load? Some config parameter?

## SECONDARY QUESTION

At what level of concurrent connections should settings like shared_buffers, effective_cache_size, max_wal_size start making a difference? With my hardware [1], I'm seeing a difference only after 48 concurrent connections. And that too it's just a 15-30% improvement over the default settings that ship with the Ubuntu 18.04 package. Is this expected? Isn't this allocating too many resources for too little gain?

## CONTEXT

I am currently trying to benchmark PG 11 (via pgbench) to figure out the configuration parameters that deliver optimum performance for my hardware [1] and workload [2]

Based on https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server I've made the following relevant changes to the default PG config on Ubuntu 18.04:

    max_connection=400
    work_mem=4MB
    maintenance_work_mem=64MB
    shared_buffers=12288MB
    temp_buffers=8MB
    effective_cache_size=16GB
    wal_buffers=-1
    wal_sync_method=fsync
    max_wal_size=5GB
    autovacuum=off  # NOTE: Only for benchmarking

[1] 32 GB RAM - 6 core/12 thread - 2x SSD in RAID1
[2] SaaS webapp -- it's a mixed workload which looks a lot like TPC-B

Thanks,
Saurabh.


--
Hi Jeff,

Thank you for replying.

 
    wal_sync_method=fsync

Why this change?  

Actually, I re-checked and noticed that this config section was left to it's default values, which is the following. Since the commented line said `wal_sync_method = fsync`, I _assumed_ that's the default value. But it seems that Linux uses fdatasync, by default, and the output of pg_test_fsync also shows that it is /probably/ the fastest method on my hardware.

    # wal_sync_method = fsync                
    # the default is the first option supported by the operating system:
    #   open_datasync
    #   fdatasync (default on Linux)
    #   fsync
    #   fsync_writethrough
    #   open_sync


PGOPTIONS="-c synchronous_commit=off" pgbench -T 3600 -P 10 ....

I am currently running all my benchmarks with synchronous_commit=off and will get back with my findings. 

You could also try pg_test_fsync to get low-level information, to supplement the high level you get from pgbench.

Thanks for pointing me to this tool. never knew pg_test_fsync existed! I've run `pg_test_fsync -s 60` two times and this is the output - https://gist.github.com/saurabhnanda/b60e8cf69032b570c5b554eb50df64f8 I'm not sure what to make of it? Shall I tweak the setting of `wal_sync_method` to something other than the default value?

The effects of max_wal_size are going to depend on how you have IO configured, for example does pg_wal shared the same devices and controllers as the base data?  It is mostly about controlling disk usage and crash-recovery performance, neither of which is of primary importance to pgbench performance.

 The WAL and the data-directory reside on the same SSD disk -- is this a bad idea? I was under the impression that smaller values for max_wal_size cause pg-server to do "maintenance work" related to wal rotation, etc. more frequently and would lead to lower pgbench performance.

Not all SSD are created equal, so the details here matter, both for the underlying drives and the raid controller.

Here's the relevant output from lshw -- https://gist.github.com/saurabhnanda/d7107d4ab1bb48e94e0a5e3ef96e7260 It seems I have Micron SSDs. I tried finding more information on RAID but couldn't get anything in the lshw or lspci output except the following -- `SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)`. Moreover, the devices are showing up as /dev/md1, /dev/md2, etc. So, if my understanding is correct, I don't think I'm on hardware RAID, but software RAID, right?

These machines are from the EX-line of dedicated servers provided by Hetzner, btw.

PS: Cc-ing the list back again because I assume you didn't intend for your reply to be private, right?

-- Saurabh.

PGOPTIONS="-c synchronous_commit=off" pgbench -T 3600 -P 10 ....

I am currently running all my benchmarks with synchronous_commit=off and will get back with my findings. 


It seems that PGOPTIONS="-c synchronous_commit=off" has a significant impact. However, I still can not understand why the TPS for the optimised case is LOWER than the default for higher concurrency levels!

+--------+---------------------+------------------------+
| client | Mostly defaults [1] | Optimised settings [2] |
+--------+---------------------+------------------------+
| 1      | 80-86               | 169-180                |
+--------+---------------------+------------------------+
| 6      | 350-376             | 1265-1397              |
+--------+---------------------+------------------------+
| 12     | 603-619             | 1746-2352              |
+--------+---------------------+------------------------+
| 24     | 947-1015            | 1869-2518              |
+--------+---------------------+------------------------+
| 48     | 1435-1512           | 1912-2818              |
+--------+---------------------+------------------------+
| 96     | 1769-1811           | 1546-1753              |
+--------+---------------------+------------------------+
| 192    | 1857-1992           | 1332-1508              |
+--------+---------------------+------------------------+
| 384    | 1667-1793           | 1356-1450              |
+--------+---------------------+------------------------+

[1] "Mostly default" settings are whatever ships with Ubuntu 18.04 + PG 11. A snippet of the relevant setts are given below:

    max_connection=400
    work_mem=4MB
    maintenance_work_mem=64MB
    shared_buffers=128MB
    temp_buffers=8MB
    effective_cache_size=4GB
    wal_buffers=-1
    wal_sync_method=fsync
    max_wal_size=1GB
    autovacuum=off            # Auto-vacuuming was disabled


[2] An optimised version of settings was obtained from https://pgtune.leopard.in.ua/#/  and along with that the benchmarks were run with PGOPTIONS="-c synchronous_commit=off"

    max_connections = 400
    shared_buffers = 8GB
    effective_cache_size = 24GB
    maintenance_work_mem = 2GB
    checkpoint_completion_target = 0.7
    wal_buffers = 16MB
    default_statistics_target = 100
    random_page_cost = 1.1
    effective_io_concurrency = 200
    work_mem = 3495kB
    min_wal_size = 1GB
    max_wal_size = 2GB
    max_worker_processes = 12
    max_parallel_workers_per_gather = 6
    max_parallel_workers = 12
    autovacuum=off            # Auto-vacuuming was disabled

On Sun, Jan 27, 2019 at 01:09:16PM +0530, Saurabh Nanda wrote:
> It seems that PGOPTIONS="-c synchronous_commit=off" has a significant
> impact. However, I still can not understand why the TPS for the optimised
> case is LOWER than the default for higher concurrency levels!

Do you know which of the settings is causing lower TPS ?

I suggest to check shared_buffers.

If you haven't done it, disabling THP and KSM can resolve performance issues,
esp. with large RAM like shared_buffers, at least with older kernels.
https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com

Justin


 
Do you know which of the settings is causing lower TPS ? 

I suggest to check shared_buffers.

I'm trying to find this, but it's taking a lot of time in re-running the benchmarks changing one config setting at a time. Thanks for the tip related to shared_buffers.
 

If you haven't done it, disabling THP and KSM can resolve performance issues,
esp. with large RAM like shared_buffers, at least with older kernels.
https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com

Is this a well-known performance "hack"? Is there any reason why it is not mentioned at https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server ? Are the stability implications of fiddling with THP and KSM well-known? Also, wrt KSM, my understand was that when a process forks the process' memory is anyways "copy on write", right? What other kind of pages would end-up being de-duplicated by ksmd? (Caveat: This is the first time I'm hearing about KSM and my knowledge is based off a single reading of https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.html )

-- Saurabh.


You could also try pg_test_fsync to get low-level information, to supplement the high level you get from pgbench.

Thanks for pointing me to this tool. never knew pg_test_fsync existed! I've run `pg_test_fsync -s 60` two times and this is the output - https://gist.github.com/saurabhnanda/b60e8cf69032b570c5b554eb50df64f8 I'm not sure what to make of it?

I don't know what to make of that either.  I'd expect fdatasync using two 8kB writes to be about the same throughput as using one 8kB write, but instead it is 4 times slower.  Also, I'd expect open_datasync to get slower by a factor of 2, not a factor of 8, when going from one to two 8kB writes (that is not directly relevant, as you aren't using open_datasync, but is curious nonetheless).  Is this reproducible with different run lengths?  I wonder if your write cache (or something) gets "tired" during the first part of pg_test_fsync and thus degrades the subsequent parts of the test.  I would say something in your IO stack is not optimal, maybe some component is "consumer grade" rather than "server grade".  Maybe you can ask Hetzner about that.


The effects of max_wal_size are going to depend on how you have IO configured, for example does pg_wal shared the same devices and controllers as the base data?  It is mostly about controlling disk usage and crash-recovery performance, neither of which is of primary importance to pgbench performance.

 The WAL and the data-directory reside on the same SSD disk -- is this a bad idea?

If you are trying to squeeze out every last bit of performance, then I think it is bad idea.  Or at least, something to try the alternative and see.  The flushing that occurs during checkpoints and the flushing that occurs for every commit can interfere with each other.
 
I was under the impression that smaller values for max_wal_size cause pg-server to do "maintenance work" related to wal rotation, etc. more frequently and would lead to lower pgbench performance.

If you choose ridiculously small values it would.  But once the value is sufficient, increasing it further wouldn't do much.  Given your low level of throughput, I would think the default is already sufficient. 
 
Thanks for including the storage info.  Nothing about it stands out to me as either good or bad, but I'm not a hardware maven; hopefully one will be reading along and speak up.


PS: Cc-ing the list back again because I assume you didn't intend for your reply to be private, right?

Yes, I had intended to include the list but hit the wrong button, sorry.

Cheers,

Jeff
On Sun, Jan 27, 2019 at 2:39 AM Saurabh Nanda <saurabhnanda@gmail.com> wrote:

PGOPTIONS="-c synchronous_commit=off" pgbench -T 3600 -P 10 ....

I am currently running all my benchmarks with synchronous_commit=off and will get back with my findings. 


It seems that PGOPTIONS="-c synchronous_commit=off" has a significant impact.

It is usually not acceptable to run applications with synchronous_commit=off, so once you have identified that the bottleneck is in implementing synchronous_commit=on, you probably need to take a deep dive into your hardware to figure out why it isn't performing the way you need/want/expect it to.  Tuning the server under synchronous_commit=off when you don't intend to run your production server with that setting is unlikely to be fruitful.


> However, I still can not understand why the TPS for the optimised case is LOWER than the default for higher concurrency levels!  

In case you do intend to run with synchronous_commit=off, or if you are just curious:  running with a very high number of active connections often reveals subtle bottlenecks and interactions, and is very dependent on your hardware.  Unless you actually intend to run our server with synchronous_commit=off and with a large number of active connections, it is probably not worth investigating this.  You can make a hobby of it, of course, but it is a time consuming hobby to have.  If you do want to, I think you should start out with your optimized settings and revert them one at a time to find the one the caused the performance regression.

I'm more interested in the low end, you should do much better than those reported numbers when clients=1 and synchronous_commit=off with the data on SSD.  I think you said that pgbench is running on a different machine than the database, so perhaps it is just network overhead that is keeping this value down.  What happens if you run them on the same machine?

> +--------+---------------------+------------------------+
> | client | Mostly defaults [1] | Optimised settings [2] |
> +--------+---------------------+------------------------+
> | 1      | 80-86               | 169-180                |
> +--------+---------------------+------------------------+

Cheers,

Jeff 
It is usually not acceptable to run applications with synchronous_commit=off, so once you have identified that the bottleneck is in implementing synchronous_commit=on, you probably need to take a deep dive into your hardware to figure out why it isn't performing the way you need/want/expect it to.  Tuning the server under synchronous_commit=off when you don't intend to run your production server with that setting is unlikely to be fruitful.

I do not intend to run the server with synchronous_commit=off, but based on my limited knowledge, I'm wondering if all these observations are somehow related and are caused by the same underlying bottleneck (or misconfiguration):

1) At higher concurrency levels, TPS for synchronous_commit=off is lower for optimised settings when compared to default settings
2) At ALL concurrency levels, TPS for synchronous_commit=on is lower for optimised settings (irrespective of shared_buffers value), compared to default settings
3) At higher concurrency levels, optimised + synchronous_commit=on + shared_buffers=2G has HIGHER TPS than optimised + synchronous_commit=off + shared_buffers=8G

Here are the (completely counter-intuitive) numbers for these observations:

+--------+-----------------------------------------------------------------+------------------------+
|        |                      synchronous_commit=on                      | synchronous_commit=off |
+--------+-----------------------------------------------------------------+------------------------+
| client | Mostly defaults [1] | Optimised [2]       | Optimised [2]       | Optimised [2]          |
|        |                     | + shared_buffers=2G | + shared_buffers=8G | + shared_buffers=8G    |
+--------+---------------------+---------------------+---------------------+------------------------+
| 1      | 80-86               | 74-77               | 75-75               | 169-180                |
+--------+---------------------+---------------------+---------------------+------------------------+
| 6      | 350-376             | 301-304             | 295-300             | 1265-1397              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 12     | 603-619             | 476-488             | 485-493             | 1746-2352              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 24     | 947-1015            | 678-739             | 723-770             | 1869-2518              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 48     | 1435-1512           | 950-1043            | 1029-1086           | 1912-2818              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 96     | 1769-1811           | 3337-3459           | 1302-1346           | 1546-1753              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 192    | 1857-1992           | 3613-3715           | 1269-1345           | 1332-1508              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 384    | 1667-1793           | 3180-3300           | 1262-1364           | 1356-1450              |
+--------+---------------------+---------------------+---------------------+------------------------+
 

In case you do intend to run with synchronous_commit=off, or if you are just curious:  running with a very high number of active connections often reveals subtle bottlenecks and interactions, and is very dependent on your hardware.  Unless you actually intend to run our server with synchronous_commit=off and with a large number of active connections, it is probably not worth investigating this.

Please see the table above. The reason why I'm digging deeper into this is because of observation (2) above, i.e. I am unable to come up with any optimised setting that performs better than the default settings for the concurrency levels that I care about (100-150).
 
I'm more interested in the low end, you should do much better than those reported numbers when clients=1 and synchronous_commit=off with the data on SSD.  I think you said that pgbench is running on a different machine than the database, so perhaps it is just network overhead that is keeping this value down.  What happens if you run them on the same machine?

I'm currently running this, but the early numbers are surprising. For client=1, the numbers for optimised settings + shared_buffers=2G are:

-- pgbench run over a 1Gbps network: 74-77 tps
-- pgbench run on the same machine: 152-153 tps (is this absolute number good enough given my hardware?)

Is 1 Gbps network the bottleneck? Does it explain the three observations given above? I'll wait for the current set of benchmarks to finish and re-run the benchmarks over the network and monitor network utilisation.

[1] "Mostly default" settings are whatever ships with Ubuntu 18.04 + PG 11. A snippet of the relevant setts are given below:

    max_connection=400
    work_mem=4MB
    maintenance_work_mem=64MB
    shared_buffers=128MB
    temp_buffers=8MB
    effective_cache_size=4GB
    wal_buffers=-1
    wal_sync_method=fsync
    max_wal_size=1GB
    autovacuum=off            # Auto-vacuuming was disabled


[2] Optimized settings

    max_connections = 400
    shared_buffers = 8GB           # or 2GB -- depending upon which scenario was being evaluated
    effective_cache_size = 24GB
    maintenance_work_mem = 2GB
    checkpoint_completion_target = 0.7
    wal_buffers = 16MB
    default_statistics_target = 100
    random_page_cost = 1.1
    effective_io_concurrency = 200
    work_mem = 3495kB
    min_wal_size = 1GB
    max_wal_size = 2GB
    max_worker_processes = 12
    max_parallel_workers_per_gather = 6
    max_parallel_workers = 12
    autovacuum=off            # Auto-vacuuming was disabled
 
Here's the previous table again -- trying to prevent the wrapping.

+--------+------------------------------------------------+-----------------+
|        |              synchronous_commit=on             | sync_commit=off |
+--------+------------------------------------------------+-----------------+
| client | Defaults [1] | Optimised [2]  | Optimised [2]  | Optimised [2]   |
|        |              | (buffers=2G)   | (buffers=8G)   | (buffers=8G)    |
+--------+--------------+----------------+----------------+-----------------+
| 1      | 80-86        | 74-77          | 75-75          | 169-180         |
+--------+--------------+----------------+----------------+-----------------+
| 6      | 350-376      | 301-304        | 295-300        | 1265-1397       |
+--------+--------------+----------------+----------------+-----------------+
| 12     | 603-619      | 476-488        | 485-493        | 1746-2352       |
+--------+--------------+----------------+----------------+-----------------+
| 24     | 947-1015     | 678-739        | 723-770        | 1869-2518       |
+--------+--------------+----------------+----------------+-----------------+
| 48     | 1435-1512    | 950-1043       | 1029-1086      | 1912-2818       |
+--------+--------------+----------------+----------------+-----------------+
| 96     | 1769-1811    | 3337-3459      | 1302-1346      | 1546-1753       |
+--------+--------------+----------------+----------------+-----------------+
| 192    | 1857-1992    | 3613-3715      | 1269-1345      | 1332-1508       |
+--------+--------------+----------------+----------------+-----------------+
| 384    | 1667-1793    | 3180-3300      | 1262-1364      | 1356-1450       |
+--------+--------------+----------------+----------------+-----------------+
All this benchmarking has led me to a philosophical question, why does PG need shared_buffers in the first place? What's wrong with letting the OS do the caching/buffering?  Isn't it optimised for this kind of stuff?
An update. It seems (to my untrained eye) that something is wrong with the second SSD in the RAID configuration. Here's my question on serverfault related to what I saw with iostat - https://serverfault.com/questions/951096/difference-in-utilisation-reported-by-iostat-for-two-identical-disks-in-raid1

I've disabled RAID and rebooted the server to run the benchmarks with client=1,4,8,12 with shared_buffers=8MB (default) vs shared_buffers=2GB (optimised?) and will report back.

-- Saurabh.

Le 28/01/2019 à 15:03, Saurabh Nanda a écrit :
> An update. It seems (to my untrained eye) that something is wrong with 
> the second SSD in the RAID configuration. Here's my question on 
> serverfault related to what I saw with iostat - 
>
https://serverfault.com/questions/951096/difference-in-utilisation-reported-by-iostat-for-two-identical-disks-in-raid1
>
> I've disabled RAID and rebooted the server to run the benchmarks with 
> client=1,4,8,12 with shared_buffers=8MB (default) vs 
> shared_buffers=2GB (optimised?) and will report back.
>
>
You should probably include the detailed hardware you are working on - 
especially for the SSD, the model can have a big impact, as well as its 
wear.


Nicolas




You should probably include the detailed hardware you are working on -
especially for the SSD, the model can have a big impact, as well as its
wear.

What's the best tool to get meaningful information for SSD drives?

-- Saurabh.
Le 28/01/2019 à 16:55, Saurabh Nanda a écrit :
You should probably include the detailed hardware you are working on -
especially for the SSD, the model can have a big impact, as well as its
wear.

What's the best tool to get meaningful information for SSD drives?

smartctl is a good start

Nicolas

smartctl is a good start

Here's the output of `smartctl --xall /dev/sda` -- https://gist.github.com/saurabhnanda/ec3c95c1eb3896b3efe55181e7c78dde

I've disabled RAID so /dev/sda is the only disk which is being currently used.

I'm still seeing very weird numbers. There seems to be absolutely no difference in performance if I increase shared_buffers from 8MB to 2GB. Is there some other setting which needs to be changed to take advantage of the increase in shared_buffers? Also, something even weirder is happening for client=1.


+--------+--------------+----------------+
| client | Defaults [1] | buffers=2G [2] |
+--------+--------------+----------------+
| 1      | 348-475 (??) | 529-583 (??)   |
+--------+--------------+----------------+
| 4      | 436-452      | 451-452        |
+--------+--------------+----------------+
| 8      | 862-869      | 859-861        |
+--------+--------------+----------------+
| 12     | 1210-1219    | 1220-1225      |
+--------+--------------+----------------+

[1] Default settings
     checkpoint_completion_target=0.5 
     default_statistics_target=100 
     effective_io_concurrency=1 
     max_parallel_workers=8 
     max_parallel_workers_per_gather=2 
     max_wal_size=1024 MB
     max_worker_processes=20 
     min_wal_size=80 MB
     random_page_cost=4 
     shared_buffers=1024 8kB
     wal_buffers=32 8kB
     work_mem=4096 kB

[2] Increased shared_buffers
     checkpoint_completion_target=0.5 
     default_statistics_target=100 
     effective_io_concurrency=1 
     max_parallel_workers=8 
     max_parallel_workers_per_gather=2 
     max_wal_size=1024 MB
     max_worker_processes=20 
     min_wal_size=80 MB
     random_page_cost=4 
     shared_buffers=262144 8kB
     wal_buffers=2048 8kB
     work_mem=4096 kB

-- Saurabh.
I've disabled transpare huge-pages and enabled huge_pages as given below. Let's see what happens. (I feel like a monkey pressing random buttons trying to turn a light bulb on... and I'm sure the monkey would've had it easier!)

AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:    5000
HugePages_Free:     4954
HugePages_Rsvd:     1015
HugePages_Surp:        0
Hugepagesize:       2048 kB

-- Saurabh.
Do you know which of the settings is causing lower TPS ?

I suggest to check shared_buffers.

If you haven't done it, disabling THP and KSM can resolve performance issues,
esp. with large RAM like shared_buffers, at least with older kernels.
https://www.postgresql.org/message-id/20170718180152.GE17566%40telsasoft.com

I've tried reducing the number of variables to a bare minimum and have the following three cases now with RAID disabled:

a) only default settings [1]
b) default settings with shared_buffers=2G [2]
c) default settings with shared_buffers=2G & huge_pages=on [3]

The numbers are still not making any sense whatsoever.

+--------+--------------+----------------+---------------+
| client | Defaults [1] | buffers=2G [2] | buffers=2G    |
|        |              |                | huge_pages=on |
+--------+--------------+----------------+---------------+
| 1      | 348-475 (??) | 529-583 (??)   | 155-290       |
+--------+--------------+----------------+---------------+
| 4      | 436-452      | 451-452        | 388-403       |
+--------+--------------+----------------+---------------+
| 8      | 862-869      | 859-861        | 778-781       |
+--------+--------------+----------------+---------------+
| 12     | 1210-1219    | 1220-1225      | 1110-1111     |
+--------+--------------+----------------+---------------+


[1] Default settings
     checkpoint_completion_target=0.5 
     default_statistics_target=100 
     effective_io_concurrency=1 
     max_parallel_workers=8 
     max_parallel_workers_per_gather=2 
     max_wal_size=1024 MB
     max_worker_processes=20 
     min_wal_size=80 MB
     random_page_cost=4 
     shared_buffers=1024 8kB
     wal_buffers=32 8kB
     work_mem=4096 kB

[2] Increased shared_buffers
     checkpoint_completion_target=0.5 
     default_statistics_target=100 
     effective_io_concurrency=1 
     max_parallel_workers=8 
     max_parallel_workers_per_gather=2 
     max_wal_size=1024 MB
     max_worker_processes=20 
     min_wal_size=80 MB
     random_page_cost=4 
     shared_buffers=262144 8kB
     wal_buffers=2048 8kB
     work_mem=4096 kB

[3] Same settings as [2] with huge_pages=on and the following changes:

     $ cat /sys/kernel/mm/transparent_hugepage/enabled 
     always madvise [never]

     $ cat /proc/meminfo |grep -i huge
     AnonHugePages:         0 kB
     ShmemHugePages:        0 kB
     HugePages_Total:    5000
     HugePages_Free:     3940
     HugePages_Rsvd:        1
     HugePages_Surp:        0
     Hugepagesize:       2048 kB

-- Saurabh.
     
Yet another update:

a) I've tried everything with me EX41-SSD server on Hetzner and nothing is increasing the performance over & above the default configuration.
b) I tried commissioning a new EX41-SSD server and was able to replicate the same pathetic performance numbers. 
c) I tried another cloud hosting provider (E2E Networks) and just the raw performance numbers (with default configuration) are blowing Hetzner out of the water.

This leads me to believe that my assumption of the first hardware (or SSD) being faulty is incorrect. Something is wrong with either the EX41-SSD hardware or the out-of-box configuration. I'm commissioning something from their PX line (which is marked as "Datacenter Edition") and checking if that makes things better.

+--------+--------------+------------------+
| client | Hetzner      | E2E Networks     |
|        | EX41-SSD [1] | Cloud Server [2] |
+--------+--------------+------------------+
| 1      | ~160         | ~400             |
+--------+--------------+------------------+
| 4      | ~460         | ~1450            |
+--------+--------------+------------------+
| 8      | ~850         | ~2600            |
+--------+--------------+------------------+
| 12     | ~1200        | ~4000            |
+--------+--------------+------------------+

[2] lshw output for E2E - https://gist.github.com/saurabhnanda/d276603990aa773269bad35f335344eb - since this is a cloud server low-level hardware info is not available. It's advertised as a 9vCPU + 30GB RAM + SSD cloud instance.

-- Saurabh.


c) I tried another cloud hosting provider (E2E Networks) and just the raw performance numbers (with default configuration) are blowing Hetzner out of the water.

I noticed that on E2E, the root filesystem is mounted with the following options:

    /dev/xvda on / type ext4 
    (rw,noatime,nodiratime,nobarrier,errors=remount-ro,stripe=512,data=ordered)

whereas on Hetzner, it is mounted with the following options:

    /dev/nvme0n1p3 on / type ext4
    (rw,relatime,data=ordered)

How much of a difference can this have on absolute TPS numbers?

-- Saurabh.

Le 29/01/2019 à 07:15, Saurabh Nanda a écrit :

c) I tried another cloud hosting provider (E2E Networks) and just the raw performance numbers (with default configuration) are blowing Hetzner out of the water.

I noticed that on E2E, the root filesystem is mounted with the following options:

    /dev/xvda on / type ext4 
    (rw,noatime,nodiratime,nobarrier,errors=remount-ro,stripe=512,data=ordered)

whereas on Hetzner, it is mounted with the following options:

    /dev/nvme0n1p3 on / type ext4
    (rw,relatime,data=ordered)

How much of a difference can this have on absolute TPS numbers?


Differences can be significative. noatime does not update inode access time, while relatime updates the inode access time if the change time was before access time (which can be often the case for a database)

nobarrier disable block-level write barriers. Barriers ensure that data is effectively stored on system, The man command says: "If disabled  on  a  device with  a  volatile  (non-battery-backed)  write-back  cache,  the nobarrier option will lead to filesystem corruption on a  system crash or power loss."

You should probably consider noatime compared to relatime, and nobarriers depends if you have a battery or not

Also, this is an SSD, so you should TRIM  it, either with preiodical fstrim, or using the discard option


Nicolas





On Mon, Jan 28, 2019 at 12:03 AM Saurabh Nanda <saurabhnanda@gmail.com> wrote:
All this benchmarking has led me to a philosophical question, why does PG need shared_buffers in the first place?

PostgreSQL cannot let the OS get its hands on a dirty shared buffer until the WAL record "protecting" that buffer has been flushed to disk. If a dirty shared buffer got written to disk, but then a crash happened before the WAL record go flushed to disk, then the data could be corrupted when it comes back up. So shared_buffers effectively serves as cooling pond where dirty buffers wait for their WAL to be flushed naturally so they can be written without instigating a performance-reducing flush just for them.

Also, concurrent clients needs to access the same disk pages at overlapping times without corrupting each other.  Perhaps that could be implemented to have just the buffer headers in shared memory to coordinate the locking, and not having the buffers themselves in shared memory.  But that is not how it is currently implemented.
 
What's wrong with letting the OS do the caching/buffering? 

Nothing, and that is what it does.  Which is why the advice for shared_buffers is often to use a small fraction of RAM, leaving the rest for the OS to do its thing.  But PostgreSQL still needs a way to lock those pages, both against concurrent access by its own clients, and against getting flushed out of order by the OS.  There is no performant way to release the dirty pages immediately to the OS while still constraining the order in which the OS flushes them to disk. 

Finally, while reading a page from the OS cache into shared_buffers is much faster than reading it from disk, it is still much slower than finding it already located in shared_buffers.  So if your entire database fits in RAM, you will get better performance if shared_buffers is large enough for the entire thing to fit in there, as well.  This is an exception to the rule that shared_buffers should be a small fraction of RAM.
 
Isn't it optimised for this kind of stuff?

Maybe.  But you might be surprised at poorly optimized it is.  It depends on your OS and version of it, of course.  If you have a high usage_count buffer which is re-dirtied constantly, it will only get written and flushed to disk once per checkpoint if under PostgreSQL control. But I've seen pages like that get written many times per second under kernel control. Whatever optimization it tried to do, it wasn't very good at.  Also, if many contiguous pages are dirtied in a close time-frame, but not dirtied in their physical order, the kernel should be able to re-order them into long sequential writes, correct?  But empirically, it doesn't, at least back in the late 2.* series kernels when I did the experiments.  I don't know if it didn't even try, or tried but failed.  (Of course back then, PostgreSQL didn't do a good job of it either)

Cheers,

Jeff