Home > mailing lists

Thread: [PERFORM] Very poor read performance, query independent

[PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

10 July 2017, 17:03:36

I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:

*2x Intel Xeon E5550

*72GB RAM

*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:

Logical Drive: 3

Size: 273.4 GB

Fault Tolerance: 1+0

Heads: 255

Sectors Per Track: 32

Cylinders: 65535

Strip Size: 128 KB

Full Stripe Size: 256 KB

Status: OK

Caching: Enabled

Unique Identifier: 600508B1001037383941424344450A00

Disk Name: /dev/sdc

Mount Points: /mnt/data 273.4 GB

OS Status: LOCKED

Logical Drive Label: A00A194750123456789ABCDE516F

Mirror Group 0:

physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)

physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)

Mirror Group 1:

physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)

physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)

Drive Type: Data

Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.

Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"

Postgresql is the only application running on this server.

Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.

My configuration is:

name | current_setting | source

---------------------------------+------------------------------------------------+----------------------

application_name | psql | client

autovacuum_vacuum_scale_factor | 0 | configuration file

autovacuum_vacuum_threshold | 2000 | configuration file

checkpoint_completion_target | 0.9 | configuration file

checkpoint_timeout | 30min | configuration file

client_encoding | UTF8 | client

client_min_messages | log | configuration file

cluster_name | 9.6/main | configuration file

cpu_index_tuple_cost | 0.001 | configuration file

cpu_operator_cost | 0.0005 | configuration file

cpu_tuple_cost | 0.003 | configuration file

DateStyle | ISO, YMD | configuration file

default_statistics_target | 100 | configuration file

default_text_search_config | pg_catalog.english | configuration file

dynamic_shared_memory_type | posix | configuration file

effective_cache_size | 22GB | configuration file

effective_io_concurrency | 4 | configuration file

external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file

lc_messages | C | configuration file

lc_monetary | en_CA.UTF-8 | configuration file

lc_numeric | en_CA.UTF-8 | configuration file

lc_time | en_CA.UTF-8 | configuration file

listen_addresses | * | configuration file

lock_timeout | 100s | configuration file

log_autovacuum_min_duration | 0 | configuration file

log_checkpoints | on | configuration file

log_connections | on | configuration file

log_destination | csvlog | configuration file

log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file

log_disconnections | on | configuration file

log_error_verbosity | default | configuration file

log_file_mode | 0600 | configuration file

log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file

log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file

log_lock_waits | on | configuration file

log_min_duration_statement | 0 | configuration file

log_min_error_statement | debug1 | configuration file

log_min_messages | debug1 | configuration file

log_rotation_size | 1GB | configuration file

log_temp_files | 0 | configuration file

log_timezone | localtime | configuration file

logging_collector | on | configuration file

maintenance_work_mem | 3GB | configuration file

max_connections | 10 | configuration file

max_locks_per_transaction | 256 | configuration file

max_parallel_workers_per_gather | 14 | configuration file

max_stack_depth | 2MB | environment variable

max_wal_size | 4GB | configuration file

max_worker_processes | 14 | configuration file

min_wal_size | 2GB | configuration file

parallel_setup_cost | 1000 | configuration file

parallel_tuple_cost | 0.012 | configuration file

port | 5432 | configuration file

random_page_cost | 22 | configuration file

seq_page_cost | 1 | configuration file

shared_buffers | 34GB | configuration file

shared_preload_libraries | pg_stat_statements | configuration file

ssl | on | configuration file

ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file

ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file

statement_timeout | 1000000s | configuration file

stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file

superuser_reserved_connections | 1 | configuration file

syslog_facility | local1 | configuration file

syslog_ident | postgres | configuration file

syslog_sequence_numbers | on | configuration file

temp_file_limit | 80GB | configuration file

TimeZone | localtime | configuration file

track_activities | on | configuration file

track_counts | on | configuration file

track_functions | all | configuration file

unix_socket_directories | /var/run/postgresql | configuration file

vacuum_cost_delay | 1ms | configuration file

vacuum_cost_limit | 5000 | configuration file

vacuum_cost_page_dirty | 200 | configuration file

vacuum_cost_page_hit | 10 | configuration file

vacuum_cost_page_miss | 100 | configuration file

wal_buffers | 16MB | configuration file

wal_compression | on | configuration file

wal_sync_method | fdatasync | configuration file

work_mem | 1468006kB | configuration file

The part of /etc/sysctl.conf I modified is:

vm.swappiness = 1

vm.dirty_background_bytes = 134217728

vm.dirty_bytes = 1073741824

vm.overcommit_ratio = 100

vm.zone_reclaim_mode = 0

kernel.numa_balancing = 0

kernel.sched_autogroup_enabled = 0

kernel.sched_migration_cost_ns = 5000000

The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.

I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.

Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.

Thanks!

Charles

Charles Nadeau Ph.D.

Re: [PERFORM] Very poor read performance, query independent

From

Rick Otten

Date:

10 July 2017, 18:25:45

Although probably not the root cause, at the least I would set up hugepages ( https://www.postgresql.org/docs/9.6/static/kernel-resources.html#LINUX-HUGE-PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?).

On Mon, Jul 10, 2017 at 10:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:
Logical Drive: 3
Size: 273.4 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 128 KB
Full Stripe Size: 256 KB
Status: OK
Caching: Enabled
Unique Identifier: 600508B1001037383941424344450A00
Disk Name: /dev/sdc
Mount Points: /mnt/data 273.4 GB
OS Status: LOCKED
Logical Drive Label: A00A194750123456789ABCDE516F
Mirror Group 0:
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
Mirror Group 1:
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)
Drive Type: Data
Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.
Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"
Postgresql is the only application running on this server.

Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.

My configuration is:

name | current_setting | source
---------------------------------+------------------------------------------------+----------------------
application_name | psql | client
autovacuum_vacuum_scale_factor | 0 | configuration file
autovacuum_vacuum_threshold | 2000 | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_timeout | 30min | configuration file
client_encoding | UTF8 | client
client_min_messages | log | configuration file
cluster_name | 9.6/main | configuration file
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.003 | configuration file
DateStyle | ISO, YMD | configuration file
default_statistics_target | 100 | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 22GB | configuration file
effective_io_concurrency | 4 | configuration file
external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file
lc_messages | C | configuration file
lc_monetary | en_CA.UTF-8 | configuration file
lc_numeric | en_CA.UTF-8 | configuration file
lc_time | en_CA.UTF-8 | configuration file
listen_addresses | * | configuration file
lock_timeout | 100s | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_destination | csvlog | configuration file
log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file
log_disconnections | on | configuration file
log_error_verbosity | default | configuration file
log_file_mode | 0600 | configuration file
log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file
log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 0 | configuration file
log_min_error_statement | debug1 | configuration file
log_min_messages | debug1 | configuration file
log_rotation_size | 1GB | configuration file
log_temp_files | 0 | configuration file
log_timezone | localtime | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 3GB | configuration file
max_connections | 10 | configuration file
max_locks_per_transaction | 256 | configuration file
max_parallel_workers_per_gather | 14 | configuration file
max_stack_depth | 2MB | environment variable
max_wal_size | 4GB | configuration file
max_worker_processes | 14 | configuration file
min_wal_size | 2GB | configuration file
parallel_setup_cost | 1000 | configuration file
parallel_tuple_cost | 0.012 | configuration file
port | 5432 | configuration file
random_page_cost | 22 | configuration file
seq_page_cost | 1 | configuration file
shared_buffers | 34GB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
ssl | on | configuration file
ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file
ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file
statement_timeout | 1000000s | configuration file
stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file
superuser_reserved_connections | 1 | configuration file
syslog_facility | local1 | configuration file
syslog_ident | postgres | configuration file
syslog_sequence_numbers | on | configuration file
temp_file_limit | 80GB | configuration file
TimeZone | localtime | configuration file
track_activities | on | configuration file
track_counts | on | configuration file
track_functions | all | configuration file
unix_socket_directories | /var/run/postgresql | configuration file
vacuum_cost_delay | 1ms | configuration file
vacuum_cost_limit | 5000 | configuration file
vacuum_cost_page_dirty | 200 | configuration file
vacuum_cost_page_hit | 10 | configuration file
vacuum_cost_page_miss | 100 | configuration file
wal_buffers | 16MB | configuration file
wal_compression | on | configuration file
wal_sync_method | fdatasync | configuration file
work_mem | 1468006kB | configuration file

The part of /etc/sysctl.conf I modified is:
vm.swappiness = 1
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 1073741824
vm.overcommit_ratio = 100
vm.zone_reclaim_mode = 0
kernel.numa_balancing = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_migration_cost_ns = 5000000

The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.
I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.
Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.
Thanks!

Charles

--
Charles Nadeau Ph.D.

Re: [PERFORM] Very poor read performance, query independent

From

Andreas Kretschmer

Date:

10 July 2017, 18:35:56


Am 10.07.2017 um 16:03 schrieb Charles Nadeau:
> random_page_cost | 22


why such a high value for random_page_cost?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

10 July 2017, 18:48:25

Andreas,

Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.

Thanks!

Charles

On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:

Am 10.07.2017 um 16:03 schrieb Charles Nadeau:
random_page_cost | 22

why such a high value for random_page_cost?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Igor Neyman

Date:

10 July 2017, 21:35:16

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Monday, July 10, 2017 11:48 AM
To: Andreas Kretschmer <andreas@a-kretschmer.de>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Andreas,

Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.

Thanks!

Charles

On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:

Am 10.07.2017 um 16:03 schrieb Charles Nadeau:
random_page_cost | 22

why such a high value for random_page_cost?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Considering RAM size of 72 GB and your database size of ~225GB, and also the fact that Postgres is the only app running on the server, probably 1/3 of your database resides in memory, so random_page_cost = 22 looks extremely high, probably it completely precludes index usage in your queries.

You should try this setting at least at its default value: random_page_cost =4, and probably go even lower.

Also, effective_cache_size is at least as big as your shared_buffers. Having 72GB RAM t effective_cache_size should be set around 64GB (again considering that Postgres is the only app running on the server).

Regards,

Igor Neyman

Re: [PERFORM] Very poor read performance, query independent

From

Jeff Janes

Date:

10 July 2017, 22:24:04

On Mon, Jul 10, 2017 at 7:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS.

That doesn't seem right. Sequential is only 43% faster? What job file are giving to fio?

What do you get if you do something simpler, like:

time cat ~/$PGDATA/base/16402/*|wc -c

replacing 16402 with whatever your biggest database is.

Cheers,

Jeff

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

11 July 2017, 13:39:19

Rick,

I applied the change you recommended but it didn't speed up the reads.

One thing I forgot to mention earlier is the speed of the backup made with the COPY operations seems almost normal: I have read speed of up to 85MB/s.

Thanks for your help!

Charles

On Mon, Jul 10, 2017 at 5:25 PM, Rick Otten <rottenwindfish@gmail.com> wrote:

Although probably not the root cause, at the least I would set up hugepages ( https://www.postgresql.org/docs/9.6/static/kernel-resources.html#LINUX-HUGE-PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?).

On Mon, Jul 10, 2017 at 10:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:
I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:
Logical Drive: 3
Size: 273.4 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 128 KB
Full Stripe Size: 256 KB
Status: OK
Caching: Enabled
Unique Identifier: 600508B1001037383941424344450A00
Disk Name: /dev/sdc
Mount Points: /mnt/data 273.4 GB
OS Status: LOCKED
Logical Drive Label: A00A194750123456789ABCDE516F
Mirror Group 0:
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
Mirror Group 1:
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)
Drive Type: Data
Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.
Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"
Postgresql is the only application running on this server.

Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.

My configuration is:

name | current_setting | source
---------------------------------+------------------------------------------------+----------------------
application_name | psql | client
autovacuum_vacuum_scale_factor | 0 | configuration file
autovacuum_vacuum_threshold | 2000 | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_timeout | 30min | configuration file
client_encoding | UTF8 | client
client_min_messages | log | configuration file
cluster_name | 9.6/main | configuration file
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.003 | configuration file
DateStyle | ISO, YMD | configuration file
default_statistics_target | 100 | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 22GB | configuration file
effective_io_concurrency | 4 | configuration file
external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file
lc_messages | C | configuration file
lc_monetary | en_CA.UTF-8 | configuration file
lc_numeric | en_CA.UTF-8 | configuration file
lc_time | en_CA.UTF-8 | configuration file
listen_addresses | * | configuration file
lock_timeout | 100s | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_destination | csvlog | configuration file
log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file
log_disconnections | on | configuration file
log_error_verbosity | default | configuration file
log_file_mode | 0600 | configuration file
log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file
log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 0 | configuration file
log_min_error_statement | debug1 | configuration file
log_min_messages | debug1 | configuration file
log_rotation_size | 1GB | configuration file
log_temp_files | 0 | configuration file
log_timezone | localtime | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 3GB | configuration file
max_connections | 10 | configuration file
max_locks_per_transaction | 256 | configuration file
max_parallel_workers_per_gather | 14 | configuration file
max_stack_depth | 2MB | environment variable
max_wal_size | 4GB | configuration file
max_worker_processes | 14 | configuration file
min_wal_size | 2GB | configuration file
parallel_setup_cost | 1000 | configuration file
parallel_tuple_cost | 0.012 | configuration file
port | 5432 | configuration file
random_page_cost | 22 | configuration file
seq_page_cost | 1 | configuration file
shared_buffers | 34GB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
ssl | on | configuration file
ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file
ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file
statement_timeout | 1000000s | configuration file
stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file
superuser_reserved_connections | 1 | configuration file
syslog_facility | local1 | configuration file
syslog_ident | postgres | configuration file
syslog_sequence_numbers | on | configuration file
temp_file_limit | 80GB | configuration file
TimeZone | localtime | configuration file
track_activities | on | configuration file
track_counts | on | configuration file
track_functions | all | configuration file
unix_socket_directories | /var/run/postgresql | configuration file
vacuum_cost_delay | 1ms | configuration file
vacuum_cost_limit | 5000 | configuration file
vacuum_cost_page_dirty | 200 | configuration file
vacuum_cost_page_hit | 10 | configuration file
vacuum_cost_page_miss | 100 | configuration file
wal_buffers | 16MB | configuration file
wal_compression | on | configuration file
wal_sync_method | fdatasync | configuration file
work_mem | 1468006kB | configuration file

The part of /etc/sysctl.conf I modified is:
vm.swappiness = 1
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 1073741824
vm.overcommit_ratio = 100
vm.zone_reclaim_mode = 0
kernel.numa_balancing = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_migration_cost_ns = 5000000

The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.
I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.
Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.
Thanks!

Charles

--
Charles Nadeau Ph.D.

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

11 July 2017, 13:42:35

Igor,

I reduced the value of random_page_cost to 4 but the read speed remains low.

Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?

Thanks for suggestions!

Charles

On Mon, Jul 10, 2017 at 8:35 PM, Igor Neyman <ineyman@perceptron.com> wrote:

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Monday, July 10, 2017 11:48 AM
To: Andreas Kretschmer <andreas@a-kretschmer.de>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Andreas,

Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.
Thanks!

Charles

On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:

Am 10.07.2017 um 16:03 schrieb Charles Nadeau:
random_page_cost | 22

why such a high value for random_page_cost?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com

--
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Considering RAM size of 72 GB and your database size of ~225GB, and also the fact that Postgres is the only app running on the server, probably 1/3 of your database resides in memory, so random_page_cost = 22 looks extremely high, probably it completely precludes index usage in your queries.

You should try this setting at least at its default value: random_page_cost =4, and probably go even lower.
Also, effective_cache_size is at least as big as your shared_buffers. Having 72GB RAM t effective_cache_size should be set around 64GB (again considering that Postgres is the only app running on the server).

Regards,
Igor Neyman

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

11 July 2017, 14:02:05

Jeff,

I used fio in a quick benchmarking script inspired by https://smcleod.net/benchmarking-io/:

#!/bin/bash
#Random throughput
echo "Random throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=randread --ramp_time=4
#Random IOPS
echo "Random IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=randread --ramp_time=4
#Sequential throughput
echo "Sequential throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4
#Sequential IOPS
echo "Sequential IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=read --ramp_time=4

Performing the test you suggested, I get 128.5MB/s. Monitoring the test, I find that the throughput is constant from start to finish and that the iowait is also constant at 5%:

charles@hpdl380g6:~$ sudo sh -c 'time cat /mnt/data/postgresql/base/16385/* | wc -c'
[sudo] password for charles:
1.62user 179.94system 29:50.79elapsed 10%CPU (0avgtext+0avgdata 1920maxresident)k
448026264inputs+0outputs (0major+117minor)pagefaults 0swaps
241297594904

After making the changes to HugePage suggested by Rick Otten (above), I found slightly better results (135.7MB/s):

charles@hpdl380g6:~$ sudo sh -c 'time cat /mnt/data/postgresql/base/16385/* | wc -c'
[sudo] password for charles:
0.86user 130.84system 28:15.78elapsed 7%CPU (0avgtext+0avgdata 1820maxresident)k
471286792inputs+0outputs (1major+118minor)pagefaults 0swaps
241297594904

Could you suggest another way to benchmark random reads?

Thanks for your help!

Charles

On Mon, Jul 10, 2017 at 9:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Jul 10, 2017 at 7:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS.

That doesn't seem right. Sequential is only 43% faster? What job file are giving to fio?

What do you get if you do something simpler, like:

time cat ~/$PGDATA/base/16402/*|wc -c

replacing 16402 with whatever your biggest database is.

Cheers,

Jeff

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

11 July 2017, 15:46:03

After reducing random_page_cost to 4 and testing more, I can report that the aggregate read throughput for parallel sequential scan is about 90MB/s. However the throughput for sequential scan is still around 4MB/s.

One more question: if a query uses more than one table, can more than one table be read through a parallel sequential scan? I have many queries joining tables and I noticed that there was never more than one table read in parallel.

Thanks!

Charles

On Mon, Jul 10, 2017 at 8:35 PM, Igor Neyman <ineyman@perceptron.com> wrote:

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Monday, July 10, 2017 11:48 AM
To: Andreas Kretschmer <andreas@a-kretschmer.de>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Andreas,

Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.
Thanks!

Charles

On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:

Am 10.07.2017 um 16:03 schrieb Charles Nadeau:
random_page_cost | 22

why such a high value for random_page_cost?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com

--
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Considering RAM size of 72 GB and your database size of ~225GB, and also the fact that Postgres is the only app running on the server, probably 1/3 of your database resides in memory, so random_page_cost = 22 looks extremely high, probably it completely precludes index usage in your queries.

You should try this setting at least at its default value: random_page_cost =4, and probably go even lower.
Also, effective_cache_size is at least as big as your shared_buffers. Having 72GB RAM t effective_cache_size should be set around 64GB (again considering that Postgres is the only app running on the server).

Regards,
Igor Neyman

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Igor Neyman

Date:

11 July 2017, 17:34:05

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Igor,

I reduced the value of random_page_cost to 4 but the read speed remains low.

Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?

Thanks for suggestions!

Charles

No, they should not be equal.

From the docs:

effective_cache_size (integer)

Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).

So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.

Regards,

Igor

Re: [PERFORM] Very poor read performance, query independent

From

Igor Neyman

Date:

11 July 2017, 18:16:52

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Tuesday, July 11, 2017 10:34 AM
To: Charles Nadeau <charles.nadeau@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Igor,

I reduced the value of random_page_cost to 4 but the read speed remains low.

Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?

Thanks for suggestions!

Charles

No, they should not be equal.

From the docs:

effective_cache_size (integer)

So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.

Regards,

Igor

Also, maybe it’s time to look at execution plans (explain analyze) of specific slow queries, instead of trying to solve the problem “in general”.

Igor

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

11 July 2017, 18:25:14

Igor,

The sum of effective_cache_size and shared_buffer will be higher than the physical memory I have. Is it OK?

Thanks!

Charles

On Tue, Jul 11, 2017 at 4:34 PM, Igor Neyman <ineyman@perceptron.com> wrote:

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Igor,

I reduced the value of random_page_cost to 4 but the read speed remains low.
Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?
Thanks for suggestions!

Charles

No, they should not be equal.
From the docs:

effective_cache_size (integer)
Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).
So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.

Regards,
Igor

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Igor Neyman

Date:

11 July 2017, 18:46:30

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 11:25 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Attention: This email was sent from someone outside of Perceptron. Always exercise caution when opening attachments or clicking links from unknown senders or when receiving unexpected emails.

Igor,

The sum of effective_cache_size and shared_buffer will be higher than the physical memory I have. Is it OK?

Thanks!

Charles

Yes, that’s normal.

shared_buffers is the maximum that Postgres allowed to allocate, while effective_cache_size is just a number that optimizer takes into account when creating execution plan.

Igor

Re: [PERFORM] Very poor read performance, query independent

From

Merlin Moncure

Date:

12 July 2017, 02:15:28

On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic).
> Hardware is:
>
> *2x Intel Xeon E5550
>
> *72GB RAM
>
> *Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80%
> read/20% write) for Postgresql data only:
>
> The problem I have is very poor read. When I benchmark my array with fio I
> get random reads of about 200MB/s and 1100IOPS and sequential reads of about
> 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get
> at best 4MB/s. Also using dstat I can see that iowait time is at about 25%.
> This problem is not query-dependent.

Stop right there.     1100 iops * 8kb = ~8mb/sec raw which might
reasonably translate to 4mb/sec to the client. 200mb/sec random
read/sec on spinning media is simply not plausible; your benchmark is
lying to you.   Random reads on spinning media are absolutely going to
be storage bound.

merlin

Re: [PERFORM] Very poor read performance, query independent

From

"Joshua D. Drake"

Date:

12 July 2017, 02:42:08

On 07/11/2017 04:15 PM, Merlin Moncure wrote:
> On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau
> <charles.nadeau@gmail.com> wrote:
>> I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic).
>> Hardware is:
>>
>> *2x Intel Xeon E5550
>>
>> *72GB RAM
>>
>> *Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80%
>> read/20% write) for Postgresql data only:
>>
>> The problem I have is very poor read. When I benchmark my array with fio I
>> get random reads of about 200MB/s and 1100IOPS and sequential reads of about
>> 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get
>> at best 4MB/s. Also using dstat I can see that iowait time is at about 25%.
>> This problem is not query-dependent.
>
> Stop right there.     1100 iops * 8kb = ~8mb/sec raw which might
> reasonably translate to 4mb/sec to the client. 200mb/sec random
> read/sec on spinning media is simply not plausible;

Sure it is, if he had more than 4 disks ;) but he also isn't going to
get 1100 IOPS from 4 10k disks. The average 10k disk is going to get
around 130 IOPS . If he only has 4 then there is no way he is getting
1100 IOPS.

Using the above specs (4x146GB) the best he can reasonably hope for from
the drives themselves is about 50MB/s add in the 1GB FWBC and that is
how he is getting those high numbers for IOPS but that is because of
caching.

He may need to adjust his readahead as well as his kernel scheduler. At
a minimum he should be able to saturate the drives without issue.

JD

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
*****     Unless otherwise stated, opinions are my own.   *****

Re: [PERFORM] Very poor read performance, query independent

From

Jeff Janes

Date:

12 July 2017, 03:39:54

On Tue, Jul 11, 2017 at 4:02 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

Jeff,

I used fio in a quick benchmarking script inspired by https://smcleod.net/benchmarking-io/:
#!/bin/bash
#Random throughput
echo "Random throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=randread --ramp_time=4
#Random IOPS
echo "Random IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=randread --ramp_time=4
#Sequential throughput
echo "Sequential throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4
#Sequential IOPS
echo "Sequential IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=read --ramp_time=4

I don't think any of those are directly relevant to PostgreSQL, as it doesn't use direct IO, doesn't use libaio, and is rarely going to get anywhere near 256 iodepth. So the best they can do is put a theoretical ceiling on the performance. Also, random IO with a 4MB stride doesn't make any sense from a PostgreSQL perspective.

Performing the test you suggested, I get 128.5MB/s. Monitoring the test, I find that the throughput is constant from start to finish and that the iowait is also constant at 5%:

I would have expected it to do better than that. Maybe you increase the kernel readahead setting. I've found the default to be much too small. But it doesn't make much difference to you, as you appear to be doing random IO in your queries, not sequential.

Could you suggest another way to benchmark random reads?

Your 1100 IOPS times 8kb block size gives about 8MB/s of throughput, which is close to what you report. So I think I'd would instead focus on tuning your actual queries. You say the problem is not query-dependent, but I think that that just means all the queries you looked at are similar. If you looked at a query that can't use indexes, like count(unindexed_column) from biggest_table; you would find it doing much more IO than 4MB/s.

Can you pick the simplest query you actually care about, and post both an "explain (analyze, timing off)" and an "explain (analyze, buffers)" for it? (Preferably turning "track_io_timing" on first).

One other question I had, you said you had "2x Intel Xeon E5550", which should be 8 CPU (or 16, if the hyperthreads

are reported as separate CPUs). But you also said: "Also using dstat I can see that iowait time is at about 25%". Usually if there is only one thing going on on the server, then IOWAIT won't be more than reciprocal of #CPU. Is the server busy doing other stuff at the same time you are benchmarking it?

Cheers,

Jeff

Re: [PERFORM] Very poor read performance, query independent

From

Jeff Janes

Date:

12 July 2017, 04:03:12

On Tue, Jul 11, 2017 at 4:42 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 07/11/2017 04:15 PM, Merlin Moncure wrote:
On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic).
Hardware is:

*2x Intel Xeon E5550

*72GB RAM

*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80%
read/20% write) for Postgresql data only:

The problem I have is very poor read. When I benchmark my array with fio I
get random reads of about 200MB/s and 1100IOPS and sequential reads of about
286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get
at best 4MB/s. Also using dstat I can see that iowait time is at about 25%.
This problem is not query-dependent.

Stop right there. 1100 iops * 8kb = ~8mb/sec raw which might
reasonably translate to 4mb/sec to the client. 200mb/sec random
read/sec on spinning media is simply not plausible;

Sure it is, if he had more than 4 disks ;)

Or more to the point here, if each random read is 4MB long. Which makes it more like sequential reads, randomly-piecewise, rather than random reads.

but he also isn't going to get 1100 IOPS from 4 10k disks. The average 10k disk is going to get around 130 IOPS . If he only has 4 then there is no way he is getting 1100 IOPS.

I wouldn't be sure. He is using an iodepth of 256 in his benchmark. It wouldn't be all that outrageous for a disk to be able to find 3 or 4 sectors per revolution it can read, when it has that many to choose from.

Cheers,

Jeff

Re: [PERFORM] Very poor read performance, query independent

From

Mark Kirkwood

Date:

12 July 2017, 05:11:53

Hmm - how are you measuring that sequential scan speed of 4MB/s? I'd
recommend doing a very simple test e.g, here's one on my workstation -
13 GB single table on 1 SATA drive - cold cache after reboot, sequential
scan using Postgres 9.6.2:

bench=#  EXPLAIN SELECT count(*) FROM pgbench_accounts;
                                      QUERY PLAN
------------------------------------------------------------------------------------
  Aggregate  (cost=2889345.00..2889345.01 rows=1 width=8)
    ->  Seq Scan on pgbench_accounts (cost=0.00..2639345.00
rows=100000000 width=0)
(2 rows)

bench=#  SELECT pg_relation_size('pgbench_accounts');
  pg_relation_size
------------------
       13429514240
(1 row)

bench=# SELECT count(*) FROM pgbench_accounts;
    count
-----------
  100000000
(1 row)

Time: 118884.277 ms

So doing the math seq read speed is about 110MB/s (i.e 13 GB in 120
sec). Sure enough, while I was running the query iostat showed:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00  926.00    0.00 114.89     0.00
254.10     1.90    2.03    2.03    0.00   1.08 100.00

So might be useful for us to see something like that from your system -
note you need to check you really have flushed the cache, and that no
other apps are using the db.

regards

Mark

On 12/07/17 00:46, Charles Nadeau wrote:
> After reducing random_page_cost to 4 and testing more, I can report
> that the aggregate read throughput for parallel sequential scan is
> about 90MB/s. However the throughput for sequential scan is still
> around 4MB/s.
>

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

12 July 2017, 10:21:09

Igor,

I set shared_buffers to 24 GB and effective_cache_size to 64GB and I can see that the queries are faster due to the fact that the index are used more often. Knowing I have 72GB of RAM and the server is exclusively dedicated to Postgresql, what could be the maximum value for effective_cache?

Thanks!

Charles

On Tue, Jul 11, 2017 at 5:16 PM, Igor Neyman <ineyman@perceptron.com> wrote:

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Tuesday, July 11, 2017 10:34 AM
To: Charles Nadeau <charles.nadeau@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Igor,

I reduced the value of random_page_cost to 4 but the read speed remains low.
Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?
Thanks for suggestions!

Charles

No, they should not be equal.
From the docs:

effective_cache_size (integer)
Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).
So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.

Regards,
Igor

Also, maybe it’s time to look at execution plans (explain analyze) of specific slow queries, instead of trying to solve the problem “in general”.

Igor

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

12 July 2017, 10:30:16

Joshua,

I use noop as the scheduler because it is better to let the RAID controller re-arrange the IO operation before they reach the disk. Read ahead is set to 128:

charles@hpdl380g6:~$ cat /sys/block/sdc/queue/read_ahead_kb
128
charles@hpdl380g6:~$ cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq

Thanks!

Charles

On Wed, Jul 12, 2017 at 1:42 AM, Joshua D. Drake <jd@commandprompt.com> wrote:

On 07/11/2017 04:15 PM, Merlin Moncure wrote:
On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic).
Hardware is:

*2x Intel Xeon E5550

*72GB RAM

*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80%
read/20% write) for Postgresql data only:

The problem I have is very poor read. When I benchmark my array with fio I
get random reads of about 200MB/s and 1100IOPS and sequential reads of about
286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get
at best 4MB/s. Also using dstat I can see that iowait time is at about 25%.
This problem is not query-dependent.

Stop right there. 1100 iops * 8kb = ~8mb/sec raw which might
reasonably translate to 4mb/sec to the client. 200mb/sec random
read/sec on spinning media is simply not plausible;

Sure it is, if he had more than 4 disks ;) but he also isn't going to get 1100 IOPS from 4 10k disks. The average 10k disk is going to get around 130 IOPS . If he only has 4 then there is no way he is getting 1100 IOPS.

Using the above specs (4x146GB) the best he can reasonably hope for from the drives themselves is about 50MB/s add in the 1GB FWBC and that is how he is getting those high numbers for IOPS but that is because of caching.

He may need to adjust his readahead as well as his kernel scheduler. At a minimum he should be able to saturate the drives without issue.

JD

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
***** Unless otherwise stated, opinions are my own. *****

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

12 July 2017, 13:04:49

Jeff,

Here are the 2 EXPLAINs for one of my simplest query:

flows=# SET track_io_timing = on;
LOG: duration: 24.101 ms statement: SET track_io_timing = on;
SET
flows=# explain (analyze, timing off) SELECT DISTINCT
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport,
flows-# COUNT(*) AS conversation,
flows-# SUM(doctets) / 1024 / 1024 AS mbytes
flows-# FROM
flows-# flowscompact,
flows-# mynetworks
flows-# WHERE
flows-# mynetworks.ipaddr >>= flowscompact.srcaddr
flows-# AND dstaddr IN
flows-# (
flows(# SELECT
flows(# dstaddr
flows(# FROM
flows(# dstexterne
flows(# )
flows-# GROUP BY
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport
flows-# ORDER BY
flows-# mbytes DESC LIMIT 50;
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.3", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.4", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.5", size 639696896
LOG: duration: 2632108.352 ms statement: explain (analyze, timing off) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual rows=50 loops=1)
-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual rows=50 loops=1)
-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 563150kB
-> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual rows=4691734 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
-> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual rows=81896988 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 2721856kB
-> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual rows=81896988 loops=1)
Workers Planned: 9
Workers Launched: 9
-> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual rows=8189699 loops=10)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
-> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual rows=45499045 loops=10)
-> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual rows=54155970 loops=10)
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual rows=1 loops=541559704)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 48679396
-> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual rows=3099798 loops=10)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB
-> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual rows=3099798 loops=10)
Group Key: flows_1.dstaddr
-> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual rows=113420172 loops=10)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 453681377
-> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual rows=541559704 loops=10)
Heap Fetches: 91
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual rows=2 loops=5415597040)
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual rows=4 loops=10)
Planning time: 62.066 ms
Execution time: 2631923.716 ms
(33 rows)

flows=# explain (analyze, buffers) SELECT DISTINCT
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport,
flows-# COUNT(*) AS conversation,
flows-# SUM(doctets) / 1024 / 1024 AS mbytes
flows-# FROM
flows-# flowscompact,
flows-# mynetworks
flows-# WHERE
flows-# mynetworks.ipaddr >>= flowscompact.srcaddr
flows-# AND dstaddr IN
flows-# (
flows(# SELECT
flows(# dstaddr
flows(# FROM
flows(# dstexterne
flows(# )
flows-# GROUP BY
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport
flows-# ORDER BY
flows-# mbytes DESC LIMIT 50;
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896
LOG: duration: 2765020.327 ms statement: explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 563150kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 2721856kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)
Workers Planned: 9
Workers Launched: 9
Buffers: shared hit=1116590559 read=15851133
I/O Timings: read=5323746.860
-> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=1116588309 read=15851133
I/O Timings: read=5323746.860
-> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)
Buffers: shared hit=996551813 read=15851133
I/O Timings: read=5323746.860
-> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)
Buffers: shared hit=1634 read=15851133
I/O Timings: read=5323746.860
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 59971474
Buffers: shared hit=996550152
-> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB
Buffers: shared hit=120036496
-> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)
Group Key: flows_1.dstaddr
Buffers: shared hit=120036496
-> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 453681377
Buffers: shared hit=120036496
-> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)
Heap Fetches: 91
Buffers: shared hit=120036459
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)
Buffers: shared hit=10
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)
Buffers: shared hit=10
Planning time: 6.689 ms
Execution time: 2764860.853 ms
(58 rows)

Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.

Thanks!

Charles

On Wed, Jul 12, 2017 at 2:39 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Jul 11, 2017 at 4:02 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:
Jeff,

I used fio in a quick benchmarking script inspired by https://smcleod.net/benchmarking-io/:
#!/bin/bash
#Random throughput
echo "Random throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=randread --ramp_time=4
#Random IOPS
echo "Random IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=randread --ramp_time=4
#Sequential throughput
echo "Sequential throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4
#Sequential IOPS
echo "Sequential IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=read --ramp_time=4

I don't think any of those are directly relevant to PostgreSQL, as it doesn't use direct IO, doesn't use libaio, and is rarely going to get anywhere near 256 iodepth. So the best they can do is put a theoretical ceiling on the performance. Also, random IO with a 4MB stride doesn't make any sense from a PostgreSQL perspective.

Performing the test you suggested, I get 128.5MB/s. Monitoring the test, I find that the throughput is constant from start to finish and that the iowait is also constant at 5%:

I would have expected it to do better than that. Maybe you increase the kernel readahead setting. I've found the default to be much too small. But it doesn't make much difference to you, as you appear to be doing random IO in your queries, not sequential.

Could you suggest another way to benchmark random reads?

Your 1100 IOPS times 8kb block size gives about 8MB/s of throughput, which is close to what you report. So I think I'd would instead focus on tuning your actual queries. You say the problem is not query-dependent, but I think that that just means all the queries you looked at are similar. If you looked at a query that can't use indexes, like count(unindexed_column) from biggest_table; you would find it doing much more IO than 4MB/s.

Can you pick the simplest query you actually care about, and post both an "explain (analyze, timing off)" and an "explain (analyze, buffers)" for it? (Preferably turning "track_io_timing" on first).

One other question I had, you said you had "2x Intel Xeon E5550", which should be 8 CPU (or 16, if the hyperthreads
are reported as separate CPUs). But you also said: "Also using dstat I can see that iowait time is at about 25%". Usually if there is only one thing going on on the server, then IOWAIT won't be more than reciprocal of #CPU. Is the server busy doing other stuff at the same time you are benchmarking it?

Cheers,

Jeff

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

12 July 2017, 16:38:23

Rick,

Should the number of page should always be correlated to the VmPeak of the postmaster or could it be set to reflect shared_buffer or another setting?

Thanks!

Charles

On Mon, Jul 10, 2017 at 5:25 PM, Rick Otten <rottenwindfish@gmail.com> wrote:

Although probably not the root cause, at the least I would set up hugepages ( https://www.postgresql.org/docs/9.6/static/kernel-resources.html#LINUX-HUGE-PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?).

On Mon, Jul 10, 2017 at 10:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:
I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:
Logical Drive: 3
Size: 273.4 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 128 KB
Full Stripe Size: 256 KB
Status: OK
Caching: Enabled
Unique Identifier: 600508B1001037383941424344450A00
Disk Name: /dev/sdc
Mount Points: /mnt/data 273.4 GB
OS Status: LOCKED
Logical Drive Label: A00A194750123456789ABCDE516F
Mirror Group 0:
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
Mirror Group 1:
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)
Drive Type: Data
Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.
Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"
Postgresql is the only application running on this server.

Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.

My configuration is:

name | current_setting | source
---------------------------------+------------------------------------------------+----------------------
application_name | psql | client
autovacuum_vacuum_scale_factor | 0 | configuration file
autovacuum_vacuum_threshold | 2000 | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_timeout | 30min | configuration file
client_encoding | UTF8 | client
client_min_messages | log | configuration file
cluster_name | 9.6/main | configuration file
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.003 | configuration file
DateStyle | ISO, YMD | configuration file
default_statistics_target | 100 | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 22GB | configuration file
effective_io_concurrency | 4 | configuration file
external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file
lc_messages | C | configuration file
lc_monetary | en_CA.UTF-8 | configuration file
lc_numeric | en_CA.UTF-8 | configuration file
lc_time | en_CA.UTF-8 | configuration file
listen_addresses | * | configuration file
lock_timeout | 100s | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_destination | csvlog | configuration file
log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file
log_disconnections | on | configuration file
log_error_verbosity | default | configuration file
log_file_mode | 0600 | configuration file
log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file
log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 0 | configuration file
log_min_error_statement | debug1 | configuration file
log_min_messages | debug1 | configuration file
log_rotation_size | 1GB | configuration file
log_temp_files | 0 | configuration file
log_timezone | localtime | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 3GB | configuration file
max_connections | 10 | configuration file
max_locks_per_transaction | 256 | configuration file
max_parallel_workers_per_gather | 14 | configuration file
max_stack_depth | 2MB | environment variable
max_wal_size | 4GB | configuration file
max_worker_processes | 14 | configuration file
min_wal_size | 2GB | configuration file
parallel_setup_cost | 1000 | configuration file
parallel_tuple_cost | 0.012 | configuration file
port | 5432 | configuration file
random_page_cost | 22 | configuration file
seq_page_cost | 1 | configuration file
shared_buffers | 34GB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
ssl | on | configuration file
ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file
ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file
statement_timeout | 1000000s | configuration file
stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file
superuser_reserved_connections | 1 | configuration file
syslog_facility | local1 | configuration file
syslog_ident | postgres | configuration file
syslog_sequence_numbers | on | configuration file
temp_file_limit | 80GB | configuration file
TimeZone | localtime | configuration file
track_activities | on | configuration file
track_counts | on | configuration file
track_functions | all | configuration file
unix_socket_directories | /var/run/postgresql | configuration file
vacuum_cost_delay | 1ms | configuration file
vacuum_cost_limit | 5000 | configuration file
vacuum_cost_page_dirty | 200 | configuration file
vacuum_cost_page_hit | 10 | configuration file
vacuum_cost_page_miss | 100 | configuration file
wal_buffers | 16MB | configuration file
wal_compression | on | configuration file
wal_sync_method | fdatasync | configuration file
work_mem | 1468006kB | configuration file

The part of /etc/sysctl.conf I modified is:
vm.swappiness = 1
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 1073741824
vm.overcommit_ratio = 100
vm.zone_reclaim_mode = 0
kernel.numa_balancing = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_migration_cost_ns = 5000000

The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.
I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.
Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.
Thanks!

Charles

--
Charles Nadeau Ph.D.

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Igor Neyman

Date:

12 July 2017, 16:57:06

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Wednesday, July 12, 2017 3:21 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Igor,

Thanks!

Charles

64GB for effective_cache_size should be good enough, adding couple more GB wouldn’t change much.

Igor

Re: [PERFORM] Very poor read performance, query independent

From

Rick Otten

Date:

12 July 2017, 17:10:27

On Wed, Jul 12, 2017 at 9:38 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

Rick,

Should the number of page should always be correlated to the VmPeak of the postmaster or could it be set to reflect shared_buffer or another setting?
Thanks!

The documentation implies that you may need to adjust its size when you change shared_buffer settings.

I usually check it every now and then (I haven't build a formal monitor yet.) to see if all of the huge pages are free/used and if it looks like they are all getting consumed - consider bumping it higher. If there are lots free, you are probably fine.

cat /proc/meminfo | grep -i "^huge"

Also regarding my note on effective_io_concurrency, which I'm not sure you tried tweaking yet.

With file system and hardware caching between you and your spindles, your best setting for effective_io_concurrency may be much higher than the actual number of spindles. It is worth experimenting with. If you can, try several values. You can use pg_bench to put consistent workloads on your database for measurement purposes.

Charles

On Mon, Jul 10, 2017 at 5:25 PM, Rick Otten <rottenwindfish@gmail.com> wrote:
Although probably not the root cause, at the least I would set up hugepages ( https://www.postgresql.org/docs/9.6/static/kernel-resources.html#LINUX-HUGE-PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?).

Re: [PERFORM] Very poor read performance, query independent

From

bricklen

Date:

12 July 2017, 17:11:54

On Wed, Jul 12, 2017 at 12:30 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

I use noop as the scheduler because it is better to let the RAID controller re-arrange the IO operation before they reach the disk. Read ahead is set to 128:
charles@hpdl380g6:~$ cat /sys/block/sdc/queue/read_ahead_kb
128
charles@hpdl380g6:~$ cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq

Perhaps pg_test_fsync (https://www.postgresql.org/docs/9.6/static/pgtestfsync.html) and pg_test_timing will help shed some light here, or at the very least give some numbers to compare against.

Re: [PERFORM] Very poor read performance, query independent

From

Igor Neyman

Date:

12 July 2017, 17:31:57

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Wednesday, July 12, 2017 6:05 AM
To: Jeff Janes <jeff.janes@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

flows=# explain (analyze, buffers) SELECT DISTINCT
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport,
flows-# COUNT(*) AS conversation,
flows-# SUM(doctets) / 1024 / 1024 AS mbytes
flows-# FROM
flows-# flowscompact,
flows-# mynetworks
flows-# WHERE
flows-# mynetworks.ipaddr >>= flowscompact.srcaddr
flows-# AND dstaddr IN
flows-# (
flows(# SELECT
flows(# dstaddr
flows(# FROM
flows(# dstexterne
flows(# )
flows-# GROUP BY
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport
flows-# ORDER BY
flows-# mbytes DESC LIMIT 50;
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896
LOG: duration: 2765020.327 ms statement: explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 563150kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 2721856kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)
Workers Planned: 9
Workers Launched: 9
Buffers: shared hit=1116590559 read=15851133
I/O Timings: read=5323746.860
-> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=1116588309 read=15851133
I/O Timings: read=5323746.860
-> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)
Buffers: shared hit=996551813 read=15851133
I/O Timings: read=5323746.860
-> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)
Buffers: shared hit=1634 read=15851133
I/O Timings: read=5323746.860
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 59971474
Buffers: shared hit=996550152
-> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB
Buffers: shared hit=120036496
-> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)
Group Key: flows_1.dstaddr
Buffers: shared hit=120036496
-> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 453681377
Buffers: shared hit=120036496
-> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)
Heap Fetches: 91
Buffers: shared hit=120036459
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)
Buffers: shared hit=10
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)
Buffers: shared hit=10
Planning time: 6.689 ms
Execution time: 2764860.853 ms
(58 rows)

Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.

Thanks!

Charles

Charles,

In your original posting I couldn’t find what value you set for temp_buffers.

Considering you have plenty of RAM, try setting temp_buffers=’6GB’ and then run ‘explain (analyze, buffers) select…’ in the same session. This should alleviate “disk sort’ problem.

Also, could you post the structure of flowscompact, mynetworks, and dstextern tables with all the indexes and number of rows. Actually, are they all – tables, or some of them – views?

Igor

Re: [PERFORM] Very poor read performance, query independent

From

Igor Neyman

Date:

12 July 2017, 19:39:08

flows=# explain (analyze, buffers) SELECT DISTINCT
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport,
flows-# COUNT(*) AS conversation,
flows-# SUM(doctets) / 1024 / 1024 AS mbytes
flows-# FROM
flows-# flowscompact,
flows-# mynetworks
flows-# WHERE
flows-# mynetworks.ipaddr >>= flowscompact.srcaddr
flows-# AND dstaddr IN
flows-# (
flows(# SELECT
flows(# dstaddr
flows(# FROM
flows(# dstexterne
flows(# )
flows-# GROUP BY
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport
flows-# ORDER BY
flows-# mbytes DESC LIMIT 50;
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896
LOG: duration: 2765020.327 ms statement: explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 563150kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 2721856kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)
Workers Planned: 9
Workers Launched: 9
Buffers: shared hit=1116590559 read=15851133
I/O Timings: read=5323746.860
-> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=1116588309 read=15851133
I/O Timings: read=5323746.860
-> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)
Buffers: shared hit=996551813 read=15851133
I/O Timings: read=5323746.860
-> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)
Buffers: shared hit=1634 read=15851133
I/O Timings: read=5323746.860
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 59971474
Buffers: shared hit=996550152
-> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB
Buffers: shared hit=120036496
-> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)
Group Key: flows_1.dstaddr
Buffers: shared hit=120036496
-> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 453681377
Buffers: shared hit=120036496
-> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)
Heap Fetches: 91
Buffers: shared hit=120036459
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)
Buffers: shared hit=10
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)
Buffers: shared hit=10
Planning time: 6.689 ms
Execution time: 2764860.853 ms
(58 rows)

Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.

Thanks!

Charles

Charles,

In your original posting I couldn’t find what value you set for temp_buffers.

Considering you have plenty of RAM, try setting temp_buffers=’6GB’ and then run ‘explain (analyze, buffers) select…’ in the same session. This should alleviate “disk sort’ problem.

Also, could you post the structure of flowscompact, mynetworks, and dstextern tables with all the indexes and number of rows. Actually, are they all – tables, or some of them – views?

Igor

Sorry, I misstated the parameter to change.

It is work_mem (not temp_buffers) you should try to increase to 6GB.

Igor

Re: [PERFORM] Very poor read performance, query independent

From

Jeff Janes

Date:

13 July 2017, 01:27:27

On Wed, Jul 12, 2017 at 3:04 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

Jeff,

Here are the 2 EXPLAINs for one of my simplest query:

It looks like dstexterne and flowcompact are both views over flow. Can you share the definition of those views?

I think the iowait > 12.5% is due to the parallel query execution. But then the question is, why is it only 25% when you have 10 fold parallelism?

It certainly looks like you are doing more than 4MB/s there, so maybe something is wrong with the instrumentation, or how you are interpreting it.

Although it is still less than perhaps it could do. To put a baseline on what you can expect out of parallel seq scans, can you do something like:

explain (analyze, buffers) select avg(doctets) from flow;

Cheers,

Jeff

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

14 July 2017, 17:34:24

Mark,

First I must say that I changed my disks configuration from 4 disks in RAID 10 to 5 disks in RAID 0 because I almost ran out of disk space during the last ingest of data.

Here is the result test you asked. It was done with a cold cache:

flows=# \timing
Timing is on.
flows=# explain select count(*) from flows;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Finalize Aggregate (cost=17214914.09..17214914.09 rows=1 width=8)
-> Gather (cost=17214914.07..17214914.09 rows=1 width=8)
Workers Planned: 1
-> Partial Aggregate (cost=17213914.07..17213914.07 rows=1 width=8)
-> Parallel Seq Scan on flows (cost=0.00..17019464.49 rows=388899162 width=0)
(5 rows)

Time: 171.835 ms
flows=# select pg_relation_size('flows');
pg_relation_size
------------------
129865867264
(1 row)

Time: 57.157 ms
flows=# select count(*) from flows;
LOG: duration: 625546.522 ms statement: select count(*) from flows;
count
-----------
589831190
(1 row)

Time: 625546.662 ms

The throughput reported by Postgresql is almost 198MB/s, and the throughput as mesured by dstat during the query execution was between 25 and 299MB/s. It is much better than what I had before! The i/o wait was about 12% all through the query. One thing I noticed is the discrepency between the read throughput reported by pg_activity and the one reported by dstat: pg_activity always report a value lower than dstat.

Besides the change of disks configuration, here is what contributed the most to the improvment of the performance so far:

Using Hugepage
Increasing effective_io_concurrency to 256
Reducing random_page_cost from 22 to 4
Reducing min_parallel_relation_size to 512kB to have more workers when doing sequential parallel scan of my biggest table

Thanks for recomending this test, I now know what the real throughput should be!

Charles

On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:

Hmm - how are you measuring that sequential scan speed of 4MB/s? I'd recommend doing a very simple test e.g, here's one on my workstation - 13 GB single table on 1 SATA drive - cold cache after reboot, sequential scan using Postgres 9.6.2:

bench=# EXPLAIN SELECT count(*) FROM pgbench_accounts;
QUERY PLAN
------------------------------------------------------------------------------------
Aggregate (cost=2889345.00..2889345.01 rows=1 width=8)
-> Seq Scan on pgbench_accounts (cost=0.00..2639345.00 rows=100000000 width=0)
(2 rows)

bench=# SELECT pg_relation_size('pgbench_accounts');
pg_relation_size
------------------
13429514240
(1 row)

bench=# SELECT count(*) FROM pgbench_accounts;
count
-----------
100000000
(1 row)

Time: 118884.277 ms

So doing the math seq read speed is about 110MB/s (i.e 13 GB in 120 sec). Sure enough, while I was running the query iostat showed:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 926.00 0.00 114.89 0.00 254.10 1.90 2.03 2.03 0.00 1.08 100.00

So might be useful for us to see something like that from your system - note you need to check you really have flushed the cache, and that no other apps are using the db.

regards

Mark

On 12/07/17 00:46, Charles Nadeau wrote:
After reducing random_page_cost to 4 and testing more, I can report that the aggregate read throughput for parallel sequential scan is about 90MB/s. However the throughput for sequential scan is still around 4MB/s.

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

14 July 2017, 18:34:43

Igor,

Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?

Here is the explain you requested with work_mem set to 6GB:

flows=# set work_mem='6GB';
SET
flows=# explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Unique (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 654395kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> GroupAggregate (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 3049216kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Gather (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)
Workers Planned: 12
Workers Launched: 12
Buffers: shared hit=728798037 read=82974833
-> Hash Semi Join (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=728795193 read=82974833
-> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)
Buffers: shared hit=590692229 read=14991777
-> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)
Buffers: shared hit=860990 read=14991777
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 0
Buffers: shared hit=589831203
-> Hash (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kB
Buffers: shared hit=138102964 read=67983056
-> HashAggregate (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)
Group Key: flows_1.dstaddr
Buffers: shared hit=138102964 read=67983056
-> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 503353617
Buffers: shared hit=138102964 read=67983056
-> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)
Buffers: shared hit=138102915 read=67983056
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)
Buffers: shared hit=13
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)
Buffers: shared hit=13
Planning time: 0.941 ms
Execution time: 2228345.171 ms
(48 rows)

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query.

flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.

flows=# \d+ flowscompact;
View "public.flowscompact"
Column | Type | Modifiers | Storage | Description
-----------+--------------------------+-----------+---------+-------------
flow_id | bigint | | plain |
sysuptime | bigint | | plain |
exaddr | ip4 | | plain |
dpkts | integer | | plain |
doctets | bigint | | plain |
first | bigint | | plain |
last | bigint | | plain |
srcaddr | ip4 | | plain |
dstaddr | ip4 | | plain |
srcport | integer | | plain |
dstport | integer | | plain |
prot | smallint | | plain |
tos | smallint | | plain |
tcp_flags | smallint | | plain |
timestamp | timestamp with time zone | | plain |
View definition:
SELECT flowstimestamp.flow_id,
flowstimestamp.sysuptime,
flowstimestamp.exaddr,
flowstimestamp.dpkts,
flowstimestamp.doctets,
flowstimestamp.first,
flowstimestamp.last,
flowstimestamp.srcaddr,
flowstimestamp.dstaddr,
flowstimestamp.srcport,
flowstimestamp.dstport,
flowstimestamp.prot,
flowstimestamp.tos,
flowstimestamp.tcp_flags,
flowstimestamp."timestamp"
FROM flowstimestamp;

mynetworks is a table having one column and 4 rows; it contains a list of our network networks:

flows=# select * from mynetworks;
ipaddr
----------------
192.168.0.0/24
10.112.12.0/30
10.112.12.4/30
10.112.12.8/30
(4 row)
flows=# \d+ mynetworks;
Table "public.mynetworks"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+---------+--------------+-------------
ipaddr | ip4r | | plain | |
Indexes:
"mynetworks_ipaddr_idx" gist (ipaddr)

dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.

flows=# \d+ dstexterne;
View "public.dstexterne"
Column | Type | Modifiers | Storage | Description
---------+------+-----------+---------+-------------
dstaddr | ip4 | | plain |
View definition:
SELECT DISTINCT flowscompact.dstaddr
FROM flowscompact
LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r
WHERE mynetworks.ipaddr IS NULL;

Thanks!

Charles

On Wed, Jul 12, 2017 at 6:39 PM, Igor Neyman <ineyman@perceptron.com> wrote:

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Wednesday, July 12, 2017 6:05 AM
To: Jeff Janes <jeff.janes@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

flows=# explain (analyze, buffers) SELECT DISTINCT
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport,
flows-# COUNT(*) AS conversation,
flows-# SUM(doctets) / 1024 / 1024 AS mbytes
flows-# FROM
flows-# flowscompact,
flows-# mynetworks
flows-# WHERE
flows-# mynetworks.ipaddr >>= flowscompact.srcaddr
flows-# AND dstaddr IN
flows-# (
flows(# SELECT
flows(# dstaddr
flows(# FROM
flows(# dstexterne
flows(# )
flows-# GROUP BY
flows-# srcaddr,
flows-# dstaddr,
flows-# dstport
flows-# ORDER BY
flows-# mbytes DESC LIMIT 50;
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824
LOG: temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896
LOG: duration: 2765020.327 ms statement: explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Unique (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 563150kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> GroupAggregate (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Sort (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 2721856kB
Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
I/O Timings: read=5323746.860
-> Gather (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)
Workers Planned: 9
Workers Launched: 9
Buffers: shared hit=1116590559 read=15851133
I/O Timings: read=5323746.860
-> Hash Semi Join (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=1116588309 read=15851133
I/O Timings: read=5323746.860
-> Nested Loop (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)
Buffers: shared hit=996551813 read=15851133
I/O Timings: read=5323746.860
-> Parallel Seq Scan on flows (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)
Buffers: shared hit=1634 read=15851133
I/O Timings: read=5323746.860
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 59971474
Buffers: shared hit=996550152
-> Hash (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 141746kB
Buffers: shared hit=120036496
-> HashAggregate (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)
Group Key: flows_1.dstaddr
Buffers: shared hit=120036496
-> Nested Loop Anti Join (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 453681377
Buffers: shared hit=120036496
-> Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1 (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)
Heap Fetches: 91
Buffers: shared hit=120036459
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)
Buffers: shared hit=10
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)
Buffers: shared hit=10
Planning time: 6.689 ms
Execution time: 2764860.853 ms
(58 rows)

Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.
Thanks!

Charles

Charles,

In your original posting I couldn’t find what value you set for temp_buffers.
Considering you have plenty of RAM, try setting temp_buffers=’6GB’ and then run ‘explain (analyze, buffers) select…’ in the same session. This should alleviate “disk sort’ problem.

Also, could you post the structure of flowscompact, mynetworks, and dstextern tables with all the indexes and number of rows. Actually, are they all – tables, or some of them – views?

Igor

Sorry, I misstated the parameter to change.
It is work_mem (not temp_buffers) you should try to increase to 6GB.

Igor

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Igor Neyman

Date:

14 July 2017, 22:13:28

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Igor,

Here is the explain you requested with work_mem set to 6GB:

flows=# set work_mem='6GB';
SET
flows=# explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Unique (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 654395kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> GroupAggregate (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 3049216kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Gather (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)
Workers Planned: 12
Workers Launched: 12
Buffers: shared hit=728798037 read=82974833
-> Hash Semi Join (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=728795193 read=82974833
-> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)
Buffers: shared hit=590692229 read=14991777
-> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)
Buffers: shared hit=860990 read=14991777
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 0
Buffers: shared hit=589831203
-> Hash (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kB
Buffers: shared hit=138102964 read=67983056
-> HashAggregate (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)
Group Key: flows_1.dstaddr
Buffers: shared hit=138102964 read=67983056
-> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 503353617
Buffers: shared hit=138102964 read=67983056
-> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)
Buffers: shared hit=138102915 read=67983056
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)
Buffers: shared hit=13
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)
Buffers: shared hit=13
Planning time: 0.941 ms
Execution time: 2228345.171 ms
(48 rows)

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query.

flows=# \d+ flowscompact;
View "public.flowscompact"
Column | Type | Modifiers | Storage | Description
-----------+--------------------------+-----------+---------+-------------
flow_id | bigint | | plain |
sysuptime | bigint | | plain |
exaddr | ip4 | | plain |
dpkts | integer | | plain |
doctets | bigint | | plain |
first | bigint | | plain |
last | bigint | | plain |
srcaddr | ip4 | | plain |
dstaddr | ip4 | | plain |
srcport | integer | | plain |
dstport | integer | | plain |
prot | smallint | | plain |
tos | smallint | | plain |
tcp_flags | smallint | | plain |
timestamp | timestamp with time zone | | plain |
View definition:
SELECT flowstimestamp.flow_id,
flowstimestamp.sysuptime,
flowstimestamp.exaddr,
flowstimestamp.dpkts,
flowstimestamp.doctets,
flowstimestamp.first,
flowstimestamp.last,
flowstimestamp.srcaddr,
flowstimestamp.dstaddr,
flowstimestamp.srcport,
flowstimestamp.dstport,
flowstimestamp.prot,
flowstimestamp.tos,
flowstimestamp.tcp_flags,
flowstimestamp."timestamp"
FROM flowstimestamp;

mynetworks is a table having one column and 4 rows; it contains a list of our network networks:

flows=# select * from mynetworks;
ipaddr
----------------
192.168.0.0/24
10.112.12.0/30
10.112.12.4/30
10.112.12.8/30
(4 row)
flows=# \d+ mynetworks;
Table "public.mynetworks"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+---------+--------------+-------------
ipaddr | ip4r | | plain | |
Indexes:
"mynetworks_ipaddr_idx" gist (ipaddr)

dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.

flows=# \d+ dstexterne;
View "public.dstexterne"
Column | Type | Modifiers | Storage | Description
---------+------+-----------+---------+-------------
dstaddr | ip4 | | plain |
View definition:
SELECT DISTINCT flowscompact.dstaddr
FROM flowscompact
LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r
WHERE mynetworks.ipaddr IS NULL;

Thanks!

Charles

Charles,

Don’t change temp_buffers.

Try to increase work_mem even more, say work_mem=’12GB’, because it’s still using disk for sorting (starting around 20^th minute as you noticed).

See if this:

“Sort Method: external merge Disk: 3049216kB”

goes away.

Igor

Re: [PERFORM] Very poor read performance, query independent

From

Igor Neyman

Date:

14 July 2017, 23:18:37

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Friday, July 14, 2017 3:13 PM
To: Charles Nadeau <charles.nadeau@gmail.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Igor,

Here is the explain you requested with work_mem set to 6GB:

flows=# set work_mem='6GB';
SET
flows=# explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Unique (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 654395kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> GroupAggregate (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 3049216kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Gather (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)
Workers Planned: 12
Workers Launched: 12
Buffers: shared hit=728798037 read=82974833
-> Hash Semi Join (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=728795193 read=82974833
-> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)
Buffers: shared hit=590692229 read=14991777
-> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)
Buffers: shared hit=860990 read=14991777
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 0
Buffers: shared hit=589831203
-> Hash (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kB
Buffers: shared hit=138102964 read=67983056
-> HashAggregate (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)
Group Key: flows_1.dstaddr
Buffers: shared hit=138102964 read=67983056
-> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 503353617
Buffers: shared hit=138102964 read=67983056
-> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)
Buffers: shared hit=138102915 read=67983056
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)
Buffers: shared hit=13
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)
Buffers: shared hit=13
Planning time: 0.941 ms
Execution time: 2228345.171 ms
(48 rows)

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query.

flows=# \d+ flowscompact;
View "public.flowscompact"
Column | Type | Modifiers | Storage | Description
-----------+--------------------------+-----------+---------+-------------
flow_id | bigint | | plain |
sysuptime | bigint | | plain |
exaddr | ip4 | | plain |
dpkts | integer | | plain |
doctets | bigint | | plain |
first | bigint | | plain |
last | bigint | | plain |
srcaddr | ip4 | | plain |
dstaddr | ip4 | | plain |
srcport | integer | | plain |
dstport | integer | | plain |
prot | smallint | | plain |
tos | smallint | | plain |
tcp_flags | smallint | | plain |
timestamp | timestamp with time zone | | plain |
View definition:
SELECT flowstimestamp.flow_id,
flowstimestamp.sysuptime,
flowstimestamp.exaddr,
flowstimestamp.dpkts,
flowstimestamp.doctets,
flowstimestamp.first,
flowstimestamp.last,
flowstimestamp.srcaddr,
flowstimestamp.dstaddr,
flowstimestamp.srcport,
flowstimestamp.dstport,
flowstimestamp.prot,
flowstimestamp.tos,
flowstimestamp.tcp_flags,
flowstimestamp."timestamp"
FROM flowstimestamp;

mynetworks is a table having one column and 4 rows; it contains a list of our network networks:

flows=# select * from mynetworks;
ipaddr
----------------
192.168.0.0/24
10.112.12.0/30
10.112.12.4/30
10.112.12.8/30
(4 row)
flows=# \d+ mynetworks;
Table "public.mynetworks"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+---------+--------------+-------------
ipaddr | ip4r | | plain | |
Indexes:
"mynetworks_ipaddr_idx" gist (ipaddr)

dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.

flows=# \d+ dstexterne;
View "public.dstexterne"
Column | Type | Modifiers | Storage | Description
---------+------+-----------+---------+-------------
dstaddr | ip4 | | plain |
View definition:
SELECT DISTINCT flowscompact.dstaddr
FROM flowscompact
LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r
WHERE mynetworks.ipaddr IS NULL;

Thanks!

Charles

Charles,

Also, let’s try to simplify your query and see if it performs better.

You are grouping by srcaddr, dstaddr, dstport, that makes DISTINCT not needed.

And after simplifying WHERE clause (let me know if the result is not what you want), the query looks like:

SELECT srcaddr, dstaddr, dstport,

COUNT(*) AS conversation,

SUM(doctets) / 1024 / 1024 AS mbytes

FROM flowscompact

WHERE srcaddr IN (SELECT ipaddr FROM mynetworks)

AND dstaddr NOT IN (SELECT ipaddr FROM mynetworks)

GROUP BY srcaddr, dstaddr, dstport

ORDER BY mbytes DESC

LIMIT 50;

Now, you didn’t provide the definition of flowstimestamp table.

If this table doesn’t have an index on (srcaddr, dstaddr, dstport) creating one should help (I think).

Igor

Re: [PERFORM] Very poor read performance, query independent

From

Mark Kirkwood

Date:

15 July 2017, 02:09:18

Ah yes - that seems more sensible (but still slower than I would expect
for 5 disks RAID 0). You should be able to get something like 5 *
(single disk speed) i.e about 500MB/s.

Might be worth increasing device read ahead (more than you have
already). Some of these so-called 'smart' RAID cards need to be hit over
the head before they will perform. E.g: I believe you have it set to 128
- I'd try 4096 or even 16384 (In the past I've used those settings on
some extremely stupid cards that refused to max out their disks known
speeds).

Also worth investigating is RAID stripe size - for DW work it makes
sense for it to be reasonably big (256K to 1M), which again will help
speed is sequential scans.

Cheers

Mark


On 15/07/17 02:09, Charles Nadeau wrote:
> Mark,
>
> First I must say that I changed my disks configuration from 4 disks in
> RAID 10 to 5 disks in RAID 0 because I almost ran out of disk space
> during the last ingest of data.
> Here is the result test you asked. It was done with a cold cache:
>
>     flows=# \timing
>     Timing is on.
>     flows=# explain select count(*) from flows;
>                                               QUERY PLAN
>     -----------------------------------------------------------------------------------------------
>      Finalize Aggregate  (cost=17214914.09..17214914.09 rows=1 width=8)
>        ->  Gather  (cost=17214914.07..17214914.09 rows=1 width=8)
>              Workers Planned: 1
>              ->  Partial Aggregate  (cost=17213914.07..17213914.07
>     rows=1 width=8)
>                    ->  Parallel Seq Scan on flows
>      (cost=0.00..17019464.49 rows=388899162 width=0)
>     (5 rows)
>
>     Time: 171.835 ms
>     flows=# select pg_relation_size('flows');
>      pg_relation_size
>     ------------------
>          129865867264
>     (1 row)
>
>     Time: 57.157 ms
>     flows=# select count(*) from flows;
>     LOG:  duration: 625546.522 ms  statement: select count(*) from flows;
>        count
>     -----------
>      589831190
>     (1 row)
>
>     Time: 625546.662 ms
>
> The throughput reported by Postgresql is almost 198MB/s, and the
> throughput as mesured by dstat during the query execution was between
> 25 and 299MB/s. It is much better than what I had before! The i/o wait
> was about 12% all through the query. One thing I noticed is the
> discrepency between the read throughput reported by pg_activity and
> the one reported by dstat: pg_activity always report a value lower
> than dstat.
>
> Besides the change of disks configuration, here is what contributed
> the most to the improvment of the performance so far:
>
>     Using Hugepage
>     Increasing effective_io_concurrency to 256
>     Reducing random_page_cost from 22 to 4
>     Reducing min_parallel_relation_size to 512kB to have more workers
>     when doing sequential parallel scan of my biggest table
>
>
> Thanks for recomending this test, I now know what the real throughput
> should be!
>
> Charles
>
> On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
> wrote:
>
>     Hmm - how are you measuring that sequential scan speed of 4MB/s?
>     I'd recommend doing a very simple test e.g, here's one on my
>     workstation - 13 GB single table on 1 SATA drive - cold cache
>     after reboot, sequential scan using Postgres 9.6.2:
>
>     bench=#  EXPLAIN SELECT count(*) FROM pgbench_accounts;
>                                          QUERY PLAN
>     ------------------------------------------------------------------------------------
>      Aggregate  (cost=2889345.00..2889345.01 rows=1 width=8)
>        ->  Seq Scan on pgbench_accounts (cost=0.00..2639345.00
>     rows=100000000 width=0)
>     (2 rows)
>
>
>     bench=#  SELECT pg_relation_size('pgbench_accounts');
>      pg_relation_size
>     ------------------
>           13429514240
>     (1 row)
>
>     bench=# SELECT count(*) FROM pgbench_accounts;
>        count
>     -----------
>      100000000
>     (1 row)
>
>     Time: 118884.277 ms
>
>
>     So doing the math seq read speed is about 110MB/s (i.e 13 GB in
>     120 sec). Sure enough, while I was running the query iostat showed:
>
>     Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s
>     avgrq-sz avgqu-sz   await r_await w_await  svctm %util
>     sda               0.00     0.00  926.00    0.00 114.89  0.00
>      254.10     1.90    2.03    2.03    0.00   1.08 100.00
>
>
>     So might be useful for us to see something like that from your
>     system - note you need to check you really have flushed the cache,
>     and that no other apps are using the db.
>
>     regards
>
>     Mark
>
>
>     On 12/07/17 00:46, Charles Nadeau wrote:
>
>         After reducing random_page_cost to 4 and testing more, I can
>         report that the aggregate read throughput for parallel
>         sequential scan is about 90MB/s. However the throughput for
>         sequential scan is still around 4MB/s.
>
>
>
>
>
> --
> Charles Nadeau Ph.D.
> http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Mark Kirkwood

Date:

15 July 2017, 02:57:11

Thinking about this a bit more - if somewhat more blazing performance is
needed, then this could be achieved via losing the RAID card and
spinning disks altogether and buying 1 of the NVME or SATA solid state
products: e.g

- Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds and
200K IOPS)

- Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)

The Samsung needs an M.2 port on the mobo (but most should have 'em -
and if not PCIe X4 adapter cards are quite cheap). The Intel is a bit
more expensive compared to the Samsung, and is slower but has a longer
lifetime. However for your workload the Sammy is probably fine.

regards

Mark

On 15/07/17 11:09, Mark Kirkwood wrote:
> Ah yes - that seems more sensible (but still slower than I would
> expect for 5 disks RAID 0).

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

15 July 2017, 19:12:25

Mark,

The server is a HP DL 380 G6. It doesn't really work with SATA drives. And when you find one that is compatible, it is only used at 3Gb/s with a maximum of 50000 IOPS (a well know caracteristic of the HP P410i SAS RAID controller). I am looking at getting a Kingston Digital HyperX Predator that I could use in one of the PCIe 2.0 x4 slot. However I am worried about the "thermal runaway", i.e. when the server can't get a temperature reading from a PCIe card, it spins the fans at full speed to protect the server against high temperature. The machine being next to my desk I worry about the deafening noise it will create.

Thanks!

Chales

On Sat, Jul 15, 2017 at 1:57 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:

Thinking about this a bit more - if somewhat more blazing performance is needed, then this could be achieved via losing the RAID card and spinning disks altogether and buying 1 of the NVME or SATA solid state products: e.g

- Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds and 200K IOPS)

- Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)

The Samsung needs an M.2 port on the mobo (but most should have 'em - and if not PCIe X4 adapter cards are quite cheap). The Intel is a bit more expensive compared to the Samsung, and is slower but has a longer lifetime. However for your workload the Sammy is probably fine.

regards

Mark

On 15/07/17 11:09, Mark Kirkwood wrote:
Ah yes - that seems more sensible (but still slower than I would expect for 5 disks RAID 0).

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

15 July 2017, 20:53:56

Mark,

I increased the read ahead to 16384 and it doesn't improve performance. My RAID 0 use a stripe size of 256k, the maximum size supported by the controller.

Thanks!

Charles

On Sat, Jul 15, 2017 at 1:02 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:

Ah yes - that seems more sensible (but still slower than I would expect for 5 disks RAID 0). You should be able to get something like 5 * (single disk speed) i.e about 500MB/s.

Might be worth increasing device read ahead (more than you have already). Some of these so-called 'smart' RAID cards need to be hit over the head before they will perform. E.g: I believe you have it set to 128 - I'd try 4096 or even 16384 (In the past I've used those settings on some extremely stupid cards that refused to max out their disks known speeds).

Also worth investigating is RAID stripe size - for DW work it makes sense for it to be reasonably big (256K to 1M), which again will help speed is sequential scans.

Cheers

Mark

P.s I used to work for Greenplum, so this type of problem came up a lot :-) . The best cards were the LSI and Areca!

On 15/07/17 02:09, Charles Nadeau wrote:
Mark,

First I must say that I changed my disks configuration from 4 disks in RAID 10 to 5 disks in RAID 0 because I almost ran out of disk space during the last ingest of data.
Here is the result test you asked. It was done with a cold cache:

flows=# \timing
Timing is on.
flows=# explain select count(*) from flows;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Finalize Aggregate (cost=17214914.09..17214914.09 rows=1 width=8)
-> Gather (cost=17214914.07..17214914.09 rows=1 width=8)
Workers Planned: 1
-> Partial Aggregate (cost=17213914.07..17213914.07
rows=1 width=8)
-> Parallel Seq Scan on flows
(cost=0.00..17019464.49 rows=388899162 width=0)
(5 rows)

Time: 171.835 ms
flows=# select pg_relation_size('flows');
pg_relation_size
------------------
129865867264
(1 row)

Time: 57.157 ms
flows=# select count(*) from flows;
LOG: duration: 625546.522 ms statement: select count(*) from flows;
count
-----------
589831190
(1 row)

Time: 625546.662 ms

The throughput reported by Postgresql is almost 198MB/s, and the throughput as mesured by dstat during the query execution was between 25 and 299MB/s. It is much better than what I had before! The i/o wait was about 12% all through the query. One thing I noticed is the discrepency between the read throughput reported by pg_activity and the one reported by dstat: pg_activity always report a value lower than dstat.

Besides the change of disks configuration, here is what contributed the most to the improvment of the performance so far:

Using Hugepage
Increasing effective_io_concurrency to 256
Reducing random_page_cost from 22 to 4
Reducing min_parallel_relation_size to 512kB to have more workers
when doing sequential parallel scan of my biggest table

Thanks for recomending this test, I now know what the real throughput should be!

Charles

On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> wrote:

Hmm - how are you measuring that sequential scan speed of 4MB/s?
I'd recommend doing a very simple test e.g, here's one on my
workstation - 13 GB single table on 1 SATA drive - cold cache
after reboot, sequential scan using Postgres 9.6.2:

bench=# EXPLAIN SELECT count(*) FROM pgbench_accounts;
QUERY PLAN
------------------------------------------------------------------------------------
Aggregate (cost=2889345.00..2889345.01 rows=1 width=8)
-> Seq Scan on pgbench_accounts (cost=0.00..2639345.00
rows=100000000 width=0)
(2 rows)

bench=# SELECT pg_relation_size('pgbench_accounts');
pg_relation_size
------------------
13429514240
(1 row)

bench=# SELECT count(*) FROM pgbench_accounts;
count
-----------
100000000
(1 row)

Time: 118884.277 ms

So doing the math seq read speed is about 110MB/s (i.e 13 GB in
120 sec). Sure enough, while I was running the query iostat showed:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 926.00 0.00 114.89 0.00 254.10 1.90 2.03 2.03 0.00 1.08 100.00

So might be useful for us to see something like that from your
system - note you need to check you really have flushed the cache,
and that no other apps are using the db.

regards

Mark

On 12/07/17 00:46, Charles Nadeau wrote:

After reducing random_page_cost to 4 and testing more, I can
report that the aggregate read throughput for parallel
sequential scan is about 90MB/s. However the throughput for
sequential scan is still around 4MB/s.

--
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Scott Marlowe

Date:

15 July 2017, 20:58:32

On Sat, Jul 15, 2017 at 11:53 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> Mark,
>
> I increased the read ahead to 16384 and it doesn't improve performance. My
> RAID 0 use a stripe size of 256k, the maximum size supported by the
> controller.

Are your queries still spilling to disk for sorts? If this is the
case, and they're just too big to fit in memory, then you need to move
your temp space, where sorts happen, onto another disk array that
isn't your poor overworked raid-10 array. Contention between sorts and
reads can kill performance quick, esp on spinning rust.

Re: [PERFORM] Very poor read performance, query independent

From

Mark Kirkwood

Date:

16 July 2017, 02:48:08

Right, that is a bit of a show stopper for those SSD (the Intel needs
SATA 6Gb/s and the Sammy's need PCIe 3.0 to perform to their rated specs).

regards

Mark


On 16/07/17 04:12, Charles Nadeau wrote:
> Mark,
>
> The server is a . It doesn't really work with SATA drives. And when
> you find one that is compatible, it is only used at 3Gb/s with a
> maximum of 50000 IOPS (a well know caracteristic of the HP P410i SAS
> RAID controller). I am looking at getting a Kingston Digital HyperX
> Predator that I could use in one of the PCIe 2.0 x4 slot. However I am
> worried about the "thermal runaway", i.e. when the server can't get a
> temperature reading from a PCIe card, it spins the fans at full speed
> to protect the server against high temperature. The machine being next
> to my desk I worry about the deafening noise it will create.
> Thanks!
>
> Chales
>
> On Sat, Jul 15, 2017 at 1:57 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
> wrote:
>
>     Thinking about this a bit more - if somewhat more blazing
>     performance is needed, then this could be achieved via losing the
>     RAID card and spinning disks altogether and buying 1 of the NVME
>     or SATA solid state products: e.g
>
>     - Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds
>     and 200K IOPS)
>
>     - Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)
>
>
>     The Samsung needs an M.2 port on the mobo (but most should have
>     'em - and if not PCIe X4 adapter cards are quite cheap). The Intel
>     is a bit more expensive compared to the Samsung, and is slower but
>     has a longer lifetime. However for your workload the Sammy is
>     probably fine.
>
>     regards
>
>     Mark
>
>     On 15/07/17 11:09, Mark Kirkwood wrote:
>
>         Ah yes - that seems more sensible (but still slower than I
>         would expect for 5 disks RAID 0).
>
>
>
>
>     --
>     Sent via pgsql-performance mailing list
>     (pgsql-performance@postgresql.org
>     <mailto:pgsql-performance@postgresql.org>)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-performance
>     <http://www.postgresql.org/mailpref/pgsql-performance>
>
>
>
>
> --
> Charles Nadeau Ph.D.
> http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

16 July 2017, 12:20:57

Igor,

I set the work_mem to 12GB, restarted postrgresql, repeat the same "explain (analyze, buffers)..." as above and the read throughput was very low, at most 10MB/s. All the sorting operation are now done in memory.

I lowered the work_mem back to 6GB, restarted postrgresql, repeat the same "explain (analyze, buffers)..." as above and the read throughput was very low, at most 10MB/s. The 1st sorting operation is still done in memory, the second one to disk. I think I need about 4GB to do all sort in memory.

One thing I remember from Friday's "explain (analyze, buffers)...". I set temp_buffer and work_mem to 6GB as I read both your message one after the other. So I decided to try again: I then set work_mem=6GB, temp_buffers=6GB, restarted postrgresql, repeat the same "explain (analyze, buffers)..." as above and the read throughput was very low again, at most 10MB/s. The 1st sorting operation is still done in memory, the second one to disk.

For the last test, I brought back work_mem and temp_buffer to their original value. The read throughput is still low.

In all cases, about 20 minutes after the query starts, it start writing to disk furiously. Here are the peak values as reported by dstat:

work_mem=12GB, temp_buffers=8MB: peak of 393MB/s

work_mem=6GB, temp_buffers=8MB: peak of 579MB/s

work_mem=6GB, temp_buffers=6GB: peak of 418MB/s

work_mem=1468006kB and temp_buffers=8MB, peak of 61MB/s

Also, at peak write, the server alomost ran of memory: the cache almost goes to 0 and it starts swapping.

This query is a bit extreme in terms of sorting. Maybe I should try to benchmark while counting all the records of my biggest table like Mark Kirkwood suggested. I'll do some more tests and post the results back to the mailing list.

Thanks!

Charles

On Fri, Jul 14, 2017 at 9:13 PM, Igor Neyman <ineyman@perceptron.com> wrote:

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Igor,

Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?
Here is the explain you requested with work_mem set to 6GB:
flows=# set work_mem='6GB';
SET
flows=# explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Unique (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 654395kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> GroupAggregate (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 3049216kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Gather (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)
Workers Planned: 12
Workers Launched: 12
Buffers: shared hit=728798037 read=82974833
-> Hash Semi Join (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=728795193 read=82974833
-> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)
Buffers: shared hit=590692229 read=14991777
-> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)
Buffers: shared hit=860990 read=14991777
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 0
Buffers: shared hit=589831203
-> Hash (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kB
Buffers: shared hit=138102964 read=67983056
-> HashAggregate (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)
Group Key: flows_1.dstaddr
Buffers: shared hit=138102964 read=67983056
-> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 503353617
Buffers: shared hit=138102964 read=67983056
-> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)
Buffers: shared hit=138102915 read=67983056
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)
Buffers: shared hit=13
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)
Buffers: shared hit=13
Planning time: 0.941 ms
Execution time: 2228345.171 ms
(48 rows)

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query.

flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.
flows=# \d+ flowscompact;
View "public.flowscompact"
Column | Type | Modifiers | Storage | Description
-----------+--------------------------+-----------+---------+-------------
flow_id | bigint | | plain |
sysuptime | bigint | | plain |
exaddr | ip4 | | plain |
dpkts | integer | | plain |
doctets | bigint | | plain |
first | bigint | | plain |
last | bigint | | plain |
srcaddr | ip4 | | plain |
dstaddr | ip4 | | plain |
srcport | integer | | plain |
dstport | integer | | plain |
prot | smallint | | plain |
tos | smallint | | plain |
tcp_flags | smallint | | plain |
timestamp | timestamp with time zone | | plain |
View definition:
SELECT flowstimestamp.flow_id,
flowstimestamp.sysuptime,
flowstimestamp.exaddr,
flowstimestamp.dpkts,
flowstimestamp.doctets,
flowstimestamp.first,
flowstimestamp.last,
flowstimestamp.srcaddr,
flowstimestamp.dstaddr,
flowstimestamp.srcport,
flowstimestamp.dstport,
flowstimestamp.prot,
flowstimestamp.tos,
flowstimestamp.tcp_flags,
flowstimestamp."timestamp"
FROM flowstimestamp;
mynetworks is a table having one column and 4 rows; it contains a list of our network networks:
flows=# select * from mynetworks;
ipaddr
----------------
192.168.0.0/24
10.112.12.0/30
10.112.12.4/30
10.112.12.8/30
(4 row)
flows=# \d+ mynetworks;
Table "public.mynetworks"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+---------+--------------+-------------
ipaddr | ip4r | | plain | |
Indexes:
"mynetworks_ipaddr_idx" gist (ipaddr)
dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.
flows=# \d+ dstexterne;
View "public.dstexterne"
Column | Type | Modifiers | Storage | Description
---------+------+-----------+---------+-------------
dstaddr | ip4 | | plain |
View definition:
SELECT DISTINCT flowscompact.dstaddr
FROM flowscompact
LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r
WHERE mynetworks.ipaddr IS NULL;
Thanks!

Charles

Charles,

Don’t change temp_buffers.
Try to increase work_mem even more, say work_mem=’12GB’, because it’s still using disk for sorting (starting around 20^th minute as you noticed).
See if this:
“Sort Method: external merge Disk: 3049216kB”
goes away.
Igor

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

16 July 2017, 12:22:00

Scott,

The temp tablespace is on a disk of his own.

Thanks!

Charles

On Sat, Jul 15, 2017 at 7:58 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Sat, Jul 15, 2017 at 11:53 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> Mark,
>
> I increased the read ahead to 16384 and it doesn't improve performance. My
> RAID 0 use a stripe size of 256k, the maximum size supported by the
> controller.

Are your queries still spilling to disk for sorts? If this is the
case, and they're just too big to fit in memory, then you need to move
your temp space, where sorts happen, onto another disk array that
isn't your poor overworked raid-10 array. Contention between sorts and
reads can kill performance quick, esp on spinning rust.

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

17 July 2017, 14:22:47

Igor,

The 1st clause of the where statement won't select addresses the same way as the one I wrote using the extension for IPv6 and IPv6 data types.

flowstimestamp is a view:

flows=# \d+ flowstimestamp

View "public.flowstimestamp"

Column | Type | Modifiers | Storage | Description

-------------+--------------------------+-----------+---------+-------------

flow_id | bigint | | plain |

unix_secs | bigint | | plain |

unix_nsecs | bigint | | plain |

sysuptime | bigint | | plain |

exaddr | ip4 | | plain |

dpkts | integer | | plain |

doctets | bigint | | plain |

first | bigint | | plain |

last | bigint | | plain |

engine_type | smallint | | plain |

engine_id | smallint | | plain |

srcaddr | ip4 | | plain |

dstaddr | ip4 | | plain |

nexthop | ip4 | | plain |

input | integer | | plain |

output | integer | | plain |

srcport | integer | | plain |

dstport | integer | | plain |

prot | smallint | | plain |

tos | smallint | | plain |

tcp_flags | smallint | | plain |

src_mask | smallint | | plain |

dst_mask | smallint | | plain |

src_as | integer | | plain |

dst_as | integer | | plain |

timestamp | timestamp with time zone | | plain |

View definition:

SELECT flows.flow_id,

flows.unix_secs,

flows.unix_nsecs,

flows.sysuptime,

flows.exaddr,

flows.dpkts,

flows.doctets,

flows.first,

flows.last,

flows.engine_type,

flows.engine_id,

flows.srcaddr,

flows.dstaddr,

flows.nexthop,

flows.input,

flows.output,

flows.srcport,

flows.dstport,

flows.prot,

flows.tos,

flows.tcp_flags,

flows.src_mask,

flows.dst_mask,

flows.src_as,

flows.dst_as,

to_timestamp((flows.unix_secs + flows.unix_nsecs / 1000000000)::double precision) AS "timestamp"

FROM flows;

And it can use the indexes of flows:

Indexes:

"flows_pkey" PRIMARY KEY, btree (flow_id)

"flows_dstaddr_dstport" btree (dstaddr, dstport)

"flows_srcaddr_dstaddr_idx" btree (srcaddr, dstaddr)

"flows_srcaddr_srcport" btree (srcaddr, srcport)

"flows_srcport_dstport_idx" btree (srcport, dstport)

Thanks!

Charles

On Fri, Jul 14, 2017 at 10:18 PM, Igor Neyman <ineyman@perceptron.com> wrote:

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Friday, July 14, 2017 3:13 PM
To: Charles Nadeau <charles.nadeau@gmail.com>

Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

Igor,

Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?
Here is the explain you requested with work_mem set to 6GB:
flows=# set work_mem='6GB';
SET
flows=# explain (analyze, buffers) SELECT DISTINCT
srcaddr,
dstaddr,
dstport,
COUNT(*) AS conversation,
SUM(doctets) / 1024 / 1024 AS mbytes
FROM
flowscompact,
mynetworks
WHERE
mynetworks.ipaddr >>= flowscompact.srcaddr
AND dstaddr IN
(
SELECT
dstaddr
FROM
dstexterne
)
GROUP BY
srcaddr,
dstaddr,
dstport
ORDER BY
mbytes DESC LIMIT 50;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Unique (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)
Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
Sort Method: quicksort Memory: 654395kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> GroupAggregate (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)
Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Sort (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)
Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
Sort Method: external merge Disk: 3049216kB
Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
-> Gather (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)
Workers Planned: 12
Workers Launched: 12
Buffers: shared hit=728798037 read=82974833
-> Hash Semi Join (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)
Hash Cond: (flows.dstaddr = flows_1.dstaddr)
Buffers: shared hit=728795193 read=82974833
-> Nested Loop (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)
Buffers: shared hit=590692229 read=14991777
-> Parallel Seq Scan on flows (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)
Buffers: shared hit=860990 read=14991777
-> Index Only Scan using mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)
Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
Heap Fetches: 0
Buffers: shared hit=589831203
-> Hash (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)
Buckets: 4194304 (originally 16384) Batches: 1 (originally 1) Memory Usage: 166486kB
Buffers: shared hit=138102964 read=67983056
-> HashAggregate (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)
Group Key: flows_1.dstaddr
Buffers: shared hit=138102964 read=67983056
-> Nested Loop Anti Join (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)
Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
Rows Removed by Join Filter: 503353617
Buffers: shared hit=138102964 read=67983056
-> Seq Scan on flows flows_1 (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)
Buffers: shared hit=138102915 read=67983056
-> Materialize (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)
Buffers: shared hit=13
-> Seq Scan on mynetworks mynetworks_1 (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)
Buffers: shared hit=13
Planning time: 0.941 ms
Execution time: 2228345.171 ms
(48 rows)

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query.

flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.
flows=# \d+ flowscompact;
View "public.flowscompact"
Column | Type | Modifiers | Storage | Description
-----------+--------------------------+-----------+---------+-------------
flow_id | bigint | | plain |
sysuptime | bigint | | plain |
exaddr | ip4 | | plain |
dpkts | integer | | plain |
doctets | bigint | | plain |
first | bigint | | plain |
last | bigint | | plain |
srcaddr | ip4 | | plain |
dstaddr | ip4 | | plain |
srcport | integer | | plain |
dstport | integer | | plain |
prot | smallint | | plain |
tos | smallint | | plain |
tcp_flags | smallint | | plain |
timestamp | timestamp with time zone | | plain |
View definition:
SELECT flowstimestamp.flow_id,
flowstimestamp.sysuptime,
flowstimestamp.exaddr,
flowstimestamp.dpkts,
flowstimestamp.doctets,
flowstimestamp.first,
flowstimestamp.last,
flowstimestamp.srcaddr,
flowstimestamp.dstaddr,
flowstimestamp.srcport,
flowstimestamp.dstport,
flowstimestamp.prot,
flowstimestamp.tos,
flowstimestamp.tcp_flags,
flowstimestamp."timestamp"
FROM flowstimestamp;
mynetworks is a table having one column and 4 rows; it contains a list of our network networks:
flows=# select * from mynetworks;
ipaddr
----------------
192.168.0.0/24
10.112.12.0/30
10.112.12.4/30
10.112.12.8/30
(4 row)
flows=# \d+ mynetworks;
Table "public.mynetworks"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+---------+--------------+-------------
ipaddr | ip4r | | plain | |
Indexes:
"mynetworks_ipaddr_idx" gist (ipaddr)
dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.
flows=# \d+ dstexterne;
View "public.dstexterne"
Column | Type | Modifiers | Storage | Description
---------+------+-----------+---------+-------------
dstaddr | ip4 | | plain |
View definition:
SELECT DISTINCT flowscompact.dstaddr
FROM flowscompact
LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r
WHERE mynetworks.ipaddr IS NULL;
Thanks!

Charles

Charles,

Also, let’s try to simplify your query and see if it performs better.
You are grouping by srcaddr, dstaddr, dstport, that makes DISTINCT not needed.
And after simplifying WHERE clause (let me know if the result is not what you want), the query looks like:

SELECT srcaddr, dstaddr, dstport,
   COUNT(*) AS conversation,
   SUM(doctets) / 1024 / 1024 AS mbytes
FROM flowscompact
WHERE srcaddr IN (SELECT ipaddr FROM mynetworks)
     AND dstaddr NOT IN (SELECT ipaddr FROM mynetworks)
GROUP BY srcaddr, dstaddr, dstport
ORDER BY mbytes DESC
LIMIT 50;

Now, you didn’t provide the definition of flowstimestamp table.
If this table doesn’t have an index on (srcaddr, dstaddr, dstport) creating one should help (I think).

Igor

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Claudio Freire

Date:

17 July 2017, 23:56:23

On Fri, Jul 14, 2017 at 12:34 PM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
>                                  Workers Planned: 12
>                                  Workers Launched: 12
>                                  Buffers: shared hit=728798037 read=82974833
>                                  ->  Hash Semi Join
> (cost=30059688.07..47951761.31 rows=220376 width=20) (actual
> time=1268845.181..2007864.725 rows=7057357 loops=13)
>                                        Hash Cond: (flows.dstaddr =
> flows_1.dstaddr)
>                                        Buffers: shared hit=728795193
> read=82974833
>                                        ->  Nested Loop
> (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283
> rows=37910370 loops=13)
>                                              Buffers: shared hit=590692229
> read=14991777
>                                              ->  Parallel Seq Scan on flows
> (cost=0.00..16018049.14 rows=55094048 width=20) (actual
> time=0.152..566179.117 rows=45371630 loops=13)
>                                                    Buffers: shared
> hit=860990 read=14991777
>                                              ->  Index Only Scan using
> mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8)
> (actual time=0.002..0.002 rows=1 loops=589831190)
>                                                    Index Cond: (ipaddr >>=
> (flows.srcaddr)::ip4r)
>                                                    Heap Fetches: 0
>                                                    Buffers: shared
> hit=589831203

12 workers on a parallel sequential scan on a RAID-10 volume of
rotating disks may not be a good idea.

Have you measured average request size and average wait times with iostat?

Run "iostat -x -m -d 60" while running the query and copy a few
relevant lines (or attach the whole thing). I suspect 12 parallel
sequential scans are degrading your array's performance to random I/O
performance, and that explains the 10MB/s very well (a rotating disk
will give you about 3-4MB/s at random I/O, and you've got 2 mirrors on
that array).

You could try setting the max_parallel_workers_per_gather to 2, which
should be the optimum allocation for your I/O layout.

You might also want to test switching to the deadline scheduler. While
the controller may get more aggregate thoughput rearranging your I/O
requests, high I/O latency will severly reduce postgres' ability to
saturate the I/O system itself, and deadlines tends to minimize
latency. I've had good results in the past using deadline, but take
this suggestion with a grain of salt, YMMV.

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

18 July 2017, 12:20:35

Claudio,

Find attached the iostat measured while redoing the query above (iostat1.txt). sda holds my temp directory (noop i/o scheduler), sdb the swap partition (cfq i/o scheduler) only and sdc (5 disks RAID0, noop i/o scheduler) holds the data. I didn't pay attention to the load caused by 12 parallel scans as I thought the RAID card would be smart enough to re-arrange the read requests optimally regardless of the load. At one moment during the query, there is a write storm to the swap drive (a bit like this case: https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com). I can hardly explain it as there is plenty of memory on this server. The execution time of the query was 4801.1s (about 1h20min).

I reduced max_parallel_workers_per_gather to 2 and max_parallel_workers to 3, restarted postgresql then ran the query again while running iostat again (iostat2.txt): The query ran much faster, 1992.8s (about 33min) instead of 4801.1s (about 1h20min) and the swap storm is gone! You were right about the max_parallel_workers_per_gather!!

For the last test, I changed the scheduler on sdc to deadline (iostat3.txt) keeping max_parallel_workers_per_gather=2 and max_parallel_workers=3 then restarted postgresql. The execution time is almost the same: 1938.7s vs 1992.8s for the noop scheduler.

Thanks a lot for the suggestion, I'll keep my number of worker low to make sure I maximize my array usage.

Charles

On Mon, Jul 17, 2017 at 10:56 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Fri, Jul 14, 2017 at 12:34 PM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> Workers Planned: 12
> Workers Launched: 12
> Buffers: shared hit=728798037 read=82974833
> -> Hash Semi Join
> (cost=30059688.07..47951761.31 rows=220376 width=20) (actual
> time=1268845.181..2007864.725 rows=7057357 loops=13)
> Hash Cond: (flows.dstaddr =
> flows_1.dstaddr)
> Buffers: shared hit=728795193
> read=82974833
> -> Nested Loop
> (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283
> rows=37910370 loops=13)
> Buffers: shared hit=590692229
> read=14991777
> -> Parallel Seq Scan on flows
> (cost=0.00..16018049.14 rows=55094048 width=20) (actual
> time=0.152..566179.117 rows=45371630 loops=13)
> Buffers: shared
> hit=860990 read=14991777
> -> Index Only Scan using
> mynetworks_ipaddr_idx on mynetworks (cost=0.03..0.03 rows=1 width=8)
> (actual time=0.002..0.002 rows=1 loops=589831190)
> Index Cond: (ipaddr >>=
> (flows.srcaddr)::ip4r)
> Heap Fetches: 0
> Buffers: shared
> hit=589831203

12 workers on a parallel sequential scan on a RAID-10 volume of
rotating disks may not be a good idea.

Have you measured average request size and average wait times with iostat?

Run "iostat -x -m -d 60" while running the query and copy a few
relevant lines (or attach the whole thing). I suspect 12 parallel
sequential scans are degrading your array's performance to random I/O
performance, and that explains the 10MB/s very well (a rotating disk
will give you about 3-4MB/s at random I/O, and you've got 2 mirrors on
that array).

You could try setting the max_parallel_workers_per_gather to 2, which
should be the optimum allocation for your I/O layout.

You might also want to test switching to the deadline scheduler. While
the controller may get more aggregate thoughput rearranging your I/O
requests, high I/O latency will severly reduce postgres' ability to
saturate the I/O system itself, and deadlines tends to minimize
latency. I've had good results in the past using deadline, but take
this suggestion with a grain of salt, YMMV.

19 July 2017, 14:48:54

Justin,

Thanks for the extensive reading list, very educative.

After reading https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ I was thinking that it could be a NUMA/THP-related problem.

Turning off THP solved the "swap storm" problem. Some queries are even 40% faster with THP off.

Then also turning off KSM improved performance by another 5%

I was seriously worried about this issue as we received today another server with 144GB of RAM.

I will try to post a little summary of all the suggestion I received via this thread later this week/early next week.

Thanks!

Charles

On Tue, Jul 18, 2017 at 8:01 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Jul 18, 2017 at 02:13:58PM -0300, Claudio Freire wrote:
> On Tue, Jul 18, 2017 at 1:01 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau
> > <charles.nadeau@gmail.com> wrote:
> >> Claudio,
> >>
> >> At one moment
> >> during the query, there is a write storm to the swap drive (a bit like this
> >> case:
> >> https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com).
> >> I can hardly explain it as there is plenty of memory on this server.
> >
> > That sounds a lot like NUMA zone_reclaim issues:
> >
> > https://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com
>
> I realize you have zone_reclaim_mode set to 0. Still, the symptoms are
> eerily similar.

Did you look at disabling KSM and/or THP ?

sudo sh -c 'echo 2 >/sys/kernel/mm/ksm/run'

https://www.postgresql.org/message-id/20170524155855.GH31097%40telsasoft.com
https://www.postgresql.org/message-id/CANQNgOrD02f8mR3Y8Pi=zFsoL14RqNQA8hwz1r4rSnDLr1b2Cw@mail.gmail.com
https://www.postgresql.org/message-id/CAHyXU0y9hviyKWvQZxX5UWfH9M2LYvwvAOPQ_DUPva2b71t12g%40mail.gmail.com
https://www.postgresql.org/message-id/20130716195834.8fe5c79249cb2ff0d4270b3e@yahoo.es
https://www.postgresql.org/message-id/CAE_gQfW3dBiELcOppYN6v%3D8%2B%2BpEeywD7iXGw-OT3doB8SXO4_A%40mail.gmail.com
https://www.postgresql.org/message-id/flat/1436268563235-5856914.post%40n5.nabble.com#1436268563235-5856914.post@n5.nabble.com
https://www.postgresql.org/message-id/CAL_0b1tJOZCx3Lo3Eve1RqGaT%2BJJ_Q7w4pkJ87WfWwXbTugnxw@mail.gmail.com
https://www.postgresql.org/message-id/556E2068.7070007@vuole.me
https://www.postgresql.org/message-id/1415981309.90631.YahooMailNeo%40web133205.mail.ir2.yahoo.com
https://www.postgresql.org/message-id/CAHyXU0yXYpCXN4%3D81ZDRQu-oGzrcq2qNAXDpyz4oiQPPAGk4ew%40mail.gmail.com
https://www.pythian.com/blog/performance-tuning-hugepages-in-linux/
http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/

Justin

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

20 July 2017, 15:50:53

Mark,

I received yesterday a second server having 16 drives bays. Just for a quick trial, I used 2 old 60GB SSD (a Kingston V300 and a ADATA SP900) to build a RAID0. To my surprise, my very pecky RAID controller (HP P410i) recognised them without a fuss (although as SATAII drives at 3Gb/s. A quick fio benchmark gives me 22000 random 4k read IOPS, more than my 5 146GB 10k SAS disks in RAID0). I moved my most frequently used index to this array and will try to do some benchmarks.

Knowing that SSDs based on SandForce-2281 controller are recognised by my server, I may buy a pair of bigger/newer ones to put my tables on.

Thanks!

Charles

On Sat, Jul 15, 2017 at 1:57 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:

Thinking about this a bit more - if somewhat more blazing performance is needed, then this could be achieved via losing the RAID card and spinning disks altogether and buying 1 of the NVME or SATA solid state products: e.g

- Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds and 200K IOPS)

- Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)

The Samsung needs an M.2 port on the mobo (but most should have 'em - and if not PCIe X4 adapter cards are quite cheap). The Intel is a bit more expensive compared to the Samsung, and is slower but has a longer lifetime. However for your workload the Sammy is probably fine.

regards

Mark

On 15/07/17 11:09, Mark Kirkwood wrote:
Ah yes - that seems more sensible (but still slower than I would expect for 5 disks RAID 0).

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Charles Nadeau

Date:

25 July 2017, 12:36:25

All,

Here is a list of what I did based of the suggestions made after my initial post:

*Reduce max_parallel_workers to 4: Values higher makes the workers wait for data as the RAID0 array can't deliver high enough IOPS.

*Reduce random_page_cost to 1: Forcing the use of index makes queries faster despite low random throughput.

*Increase shared_buffer to 66GB and effective_cache_size to 53GB: With the new server having 144GB of RAM, increasing shared_buffer allows Postgresql to keep a lot of data in memory reducing the need to go to disk.

*Reduce min_parallel_relation_size to 512kB to have more workers when doing sequential parallel scan

*Increased the /sys/block/sd[ac]/queue/read_ahead_kb to 16384 for my arrays using HDD

*Reused old SSDs (that are compatible with my RAID controller, to my surprise) to put my most used index and tables.

Thanks to everybody who made suggestions. I now know more about Postgresql tuning.

Charles

On Mon, Jul 10, 2017 at 4:03 PM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:
*2x Intel Xeon E5550
*72GB RAM
*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:
Logical Drive: 3
Size: 273.4 GB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 128 KB
Full Stripe Size: 256 KB
Status: OK
Caching: Enabled
Unique Identifier: 600508B1001037383941424344450A00
Disk Name: /dev/sdc
Mount Points: /mnt/data 273.4 GB
OS Status: LOCKED
Logical Drive Label: A00A194750123456789ABCDE516F
Mirror Group 0:
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)
Mirror Group 1:
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)
physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)
Drive Type: Data
Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.
Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"
Postgresql is the only application running on this server.

Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.

My configuration is:

name | current_setting | source
---------------------------------+------------------------------------------------+----------------------
application_name | psql | client
autovacuum_vacuum_scale_factor | 0 | configuration file
autovacuum_vacuum_threshold | 2000 | configuration file
checkpoint_completion_target | 0.9 | configuration file
checkpoint_timeout | 30min | configuration file
client_encoding | UTF8 | client
client_min_messages | log | configuration file
cluster_name | 9.6/main | configuration file
cpu_index_tuple_cost | 0.001 | configuration file
cpu_operator_cost | 0.0005 | configuration file
cpu_tuple_cost | 0.003 | configuration file
DateStyle | ISO, YMD | configuration file
default_statistics_target | 100 | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
effective_cache_size | 22GB | configuration file
effective_io_concurrency | 4 | configuration file
external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file
lc_messages | C | configuration file
lc_monetary | en_CA.UTF-8 | configuration file
lc_numeric | en_CA.UTF-8 | configuration file
lc_time | en_CA.UTF-8 | configuration file
listen_addresses | * | configuration file
lock_timeout | 100s | configuration file
log_autovacuum_min_duration | 0 | configuration file
log_checkpoints | on | configuration file
log_connections | on | configuration file
log_destination | csvlog | configuration file
log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file
log_disconnections | on | configuration file
log_error_verbosity | default | configuration file
log_file_mode | 0600 | configuration file
log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file
log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 0 | configuration file
log_min_error_statement | debug1 | configuration file
log_min_messages | debug1 | configuration file
log_rotation_size | 1GB | configuration file
log_temp_files | 0 | configuration file
log_timezone | localtime | configuration file
logging_collector | on | configuration file
maintenance_work_mem | 3GB | configuration file
max_connections | 10 | configuration file
max_locks_per_transaction | 256 | configuration file
max_parallel_workers_per_gather | 14 | configuration file
max_stack_depth | 2MB | environment variable
max_wal_size | 4GB | configuration file
max_worker_processes | 14 | configuration file
min_wal_size | 2GB | configuration file
parallel_setup_cost | 1000 | configuration file
parallel_tuple_cost | 0.012 | configuration file
port | 5432 | configuration file
random_page_cost | 22 | configuration file
seq_page_cost | 1 | configuration file
shared_buffers | 34GB | configuration file
shared_preload_libraries | pg_stat_statements | configuration file
ssl | on | configuration file
ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file
ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file
statement_timeout | 1000000s | configuration file
stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file
superuser_reserved_connections | 1 | configuration file
syslog_facility | local1 | configuration file
syslog_ident | postgres | configuration file
syslog_sequence_numbers | on | configuration file
temp_file_limit | 80GB | configuration file
TimeZone | localtime | configuration file
track_activities | on | configuration file
track_counts | on | configuration file
track_functions | all | configuration file
unix_socket_directories | /var/run/postgresql | configuration file
vacuum_cost_delay | 1ms | configuration file
vacuum_cost_limit | 5000 | configuration file
vacuum_cost_page_dirty | 200 | configuration file
vacuum_cost_page_hit | 10 | configuration file
vacuum_cost_page_miss | 100 | configuration file
wal_buffers | 16MB | configuration file
wal_compression | on | configuration file
wal_sync_method | fdatasync | configuration file
work_mem | 1468006kB | configuration file

The part of /etc/sysctl.conf I modified is:
vm.swappiness = 1
vm.dirty_background_bytes = 134217728
vm.dirty_bytes = 1073741824
vm.overcommit_ratio = 100
vm.zone_reclaim_mode = 0
kernel.numa_balancing = 0
kernel.sched_autogroup_enabled = 0
kernel.sched_migration_cost_ns = 5000000

The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.
I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.
Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.
Thanks!

Charles

--
Charles Nadeau Ph.D.

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

Re: [PERFORM] Very poor read performance, query independent

From

Mark Kirkwood

Date:

19 August 2017, 09:51:34

Nice!

Pleased that the general idea worked well for you!

I'm also relieved that you did not follow my recommendation exactly -
I'm been trialling a Samsung 960 Evo (256GB) and Intel 600p (256GB) and
I've stumbled across the serious disadvantages of (consumer) M.2 drives
using TLC NAND - terrible sustained write performance! While these guys
can happily do ~ 2GB/s reads, their write performance is only 'burst
capable'. They have small SLC NAND 'write caches' that do  ~1GB/s for a
*limited time* (10-20s) and after that you get ~ 200 MB/s! Ouch - my old
Crucial 550 can do 350 MB/s sustained writes (so two of them in RAID0
are doing 700 MB/s for hours).

Bigger capacity drives can do better - but overall I'm not that
impressed with the current trend of using TLC NAND.

regards

Mark


On 21/07/17 00:50, Charles Nadeau wrote:
> Mark,
>
> I received yesterday a second server having 16 drives bays. Just for a
> quick trial, I used 2 old 60GB SSD (a Kingston V300 and a ADATA SP900)
> to build a RAID0. To my surprise, my very pecky RAID controller (HP
> P410i) recognised them without a fuss (although as SATAII drives at
> 3Gb/s. A quick fio benchmark gives me 22000 random 4k read IOPS, more
> than my 5 146GB 10k SAS disks in RAID0). I moved my most frequently
> used index to this array and will try to do some benchmarks.
> Knowing that SSDs based on SandForce-2281 controller are recognised by
> my server, I may buy a pair of bigger/newer ones to put my tables on.
>
> Thanks!
>
> Charles
>
> On Sat, Jul 15, 2017 at 1:57 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
> wrote:
>
>     Thinking about this a bit more - if somewhat more blazing
>     performance is needed, then this could be achieved via losing the
>     RAID card and spinning disks altogether and buying 1 of the NVME
>     or SATA solid state products: e.g
>
>     - Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds
>     and 200K IOPS)
>
>     - Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)
>
>
>     The Samsung needs an M.2 port on the mobo (but most should have
>     'em - and if not PCIe X4 adapter cards are quite cheap). The Intel
>     is a bit more expensive compared to the Samsung, and is slower but
>     has a longer lifetime. However for your workload the Sammy is
>     probably fine.
>
>     regards
>
>     Mark
>
>     On 15/07/17 11:09, Mark Kirkwood wrote:
>
>         Ah yes - that seems more sensible (but still slower than I
>         would expect for 5 disks RAID 0).
>
>
>
>
>     --
>     Sent via pgsql-performance mailing list
>     (pgsql-performance@postgresql.org
>     <mailto:pgsql-performance@postgresql.org>)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-performance
>     <http://www.postgresql.org/mailpref/pgsql-performance>
>
>
>
>
> --
> Charles Nadeau Ph.D.
> http://charlesnadeau.blogspot.com/