Thread: [PERFORM] Very poor read performance, query independent

[PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:

I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:

*2x Intel Xeon E5550

*72GB RAM

*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:

Logical Drive: 3

Size: 273.4 GB

Fault Tolerance: 1+0

Heads: 255

Sectors Per Track: 32

Cylinders: 65535

Strip Size: 128 KB

Full Stripe Size: 256 KB

Status: OK

Caching: Enabled

Unique Identifier: 600508B1001037383941424344450A00

Disk Name: /dev/sdc

Mount Points: /mnt/data 273.4 GB

OS Status: LOCKED

Logical Drive Label: A00A194750123456789ABCDE516F

Mirror Group 0:

physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)

physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)

Mirror Group 1:

physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)

physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)

Drive Type: Data

Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.

Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"

Postgresql is the only application running on this server.


Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.


My configuration is:


name | current_setting | source

---------------------------------+------------------------------------------------+----------------------

application_name | psql | client

autovacuum_vacuum_scale_factor | 0 | configuration file

autovacuum_vacuum_threshold | 2000 | configuration file

checkpoint_completion_target | 0.9 | configuration file

checkpoint_timeout | 30min | configuration file

client_encoding | UTF8 | client

client_min_messages | log | configuration file

cluster_name | 9.6/main | configuration file

cpu_index_tuple_cost | 0.001 | configuration file

cpu_operator_cost | 0.0005 | configuration file

cpu_tuple_cost | 0.003 | configuration file

DateStyle | ISO, YMD | configuration file

default_statistics_target | 100 | configuration file

default_text_search_config | pg_catalog.english | configuration file

dynamic_shared_memory_type | posix | configuration file

effective_cache_size | 22GB | configuration file

effective_io_concurrency | 4 | configuration file

external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file

lc_messages | C | configuration file

lc_monetary | en_CA.UTF-8 | configuration file

lc_numeric | en_CA.UTF-8 | configuration file

lc_time | en_CA.UTF-8 | configuration file

listen_addresses | * | configuration file

lock_timeout | 100s | configuration file

log_autovacuum_min_duration | 0 | configuration file

log_checkpoints | on | configuration file

log_connections | on | configuration file

log_destination | csvlog | configuration file

log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file

log_disconnections | on | configuration file

log_error_verbosity | default | configuration file

log_file_mode | 0600 | configuration file

log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file

log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file

log_lock_waits | on | configuration file

log_min_duration_statement | 0 | configuration file

log_min_error_statement | debug1 | configuration file

log_min_messages | debug1 | configuration file

log_rotation_size | 1GB | configuration file

log_temp_files | 0 | configuration file

log_timezone | localtime | configuration file

logging_collector | on | configuration file

maintenance_work_mem | 3GB | configuration file

max_connections | 10 | configuration file

max_locks_per_transaction | 256 | configuration file

max_parallel_workers_per_gather | 14 | configuration file

max_stack_depth | 2MB | environment variable

max_wal_size | 4GB | configuration file

max_worker_processes | 14 | configuration file

min_wal_size | 2GB | configuration file

parallel_setup_cost | 1000 | configuration file

parallel_tuple_cost | 0.012 | configuration file

port | 5432 | configuration file

random_page_cost | 22 | configuration file

seq_page_cost | 1 | configuration file

shared_buffers | 34GB | configuration file

shared_preload_libraries | pg_stat_statements | configuration file

ssl | on | configuration file

ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file

ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file

statement_timeout | 1000000s | configuration file

stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file

superuser_reserved_connections | 1 | configuration file

syslog_facility | local1 | configuration file

syslog_ident | postgres | configuration file

syslog_sequence_numbers | on | configuration file

temp_file_limit | 80GB | configuration file

TimeZone | localtime | configuration file

track_activities | on | configuration file

track_counts | on | configuration file

track_functions | all | configuration file

unix_socket_directories | /var/run/postgresql | configuration file

vacuum_cost_delay | 1ms | configuration file

vacuum_cost_limit | 5000 | configuration file

vacuum_cost_page_dirty | 200 | configuration file

vacuum_cost_page_hit | 10 | configuration file

vacuum_cost_page_miss | 100 | configuration file

wal_buffers | 16MB | configuration file

wal_compression | on | configuration file

wal_sync_method | fdatasync | configuration file

work_mem | 1468006kB | configuration file


The part of /etc/sysctl.conf I modified is:

vm.swappiness = 1

vm.dirty_background_bytes = 134217728

vm.dirty_bytes = 1073741824

vm.overcommit_ratio = 100

vm.zone_reclaim_mode = 0

kernel.numa_balancing = 0

kernel.sched_autogroup_enabled = 0

kernel.sched_migration_cost_ns = 5000000


The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.

I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.

Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.

Thanks!


Charles


--
Charles Nadeau Ph.D.

Re: [PERFORM] Very poor read performance, query independent

From
Rick Otten
Date:
Although probably not the root cause, at the least I would set up hugepages  ( https://www.postgresql.org/docs/9.6/static/kernel-resources.html#LINUX-HUGE-PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?).


On Mon, Jul 10, 2017 at 10:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:

*2x Intel Xeon E5550

*72GB RAM

*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:

Logical Drive: 3

Size: 273.4 GB

Fault Tolerance: 1+0

Heads: 255

Sectors Per Track: 32

Cylinders: 65535

Strip Size: 128 KB

Full Stripe Size: 256 KB

Status: OK

Caching: Enabled

Unique Identifier: 600508B1001037383941424344450A00

Disk Name: /dev/sdc

Mount Points: /mnt/data 273.4 GB

OS Status: LOCKED

Logical Drive Label: A00A194750123456789ABCDE516F

Mirror Group 0:

physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)

physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)

Mirror Group 1:

physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)

physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)

Drive Type: Data

Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.

Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"

Postgresql is the only application running on this server.


Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.


My configuration is:


name | current_setting | source

---------------------------------+------------------------------------------------+----------------------

application_name | psql | client

autovacuum_vacuum_scale_factor | 0 | configuration file

autovacuum_vacuum_threshold | 2000 | configuration file

checkpoint_completion_target | 0.9 | configuration file

checkpoint_timeout | 30min | configuration file

client_encoding | UTF8 | client

client_min_messages | log | configuration file

cluster_name | 9.6/main | configuration file

cpu_index_tuple_cost | 0.001 | configuration file

cpu_operator_cost | 0.0005 | configuration file

cpu_tuple_cost | 0.003 | configuration file

DateStyle | ISO, YMD | configuration file

default_statistics_target | 100 | configuration file

default_text_search_config | pg_catalog.english | configuration file

dynamic_shared_memory_type | posix | configuration file

effective_cache_size | 22GB | configuration file

effective_io_concurrency | 4 | configuration file

external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file

lc_messages | C | configuration file

lc_monetary | en_CA.UTF-8 | configuration file

lc_numeric | en_CA.UTF-8 | configuration file

lc_time | en_CA.UTF-8 | configuration file

listen_addresses | * | configuration file

lock_timeout | 100s | configuration file

log_autovacuum_min_duration | 0 | configuration file

log_checkpoints | on | configuration file

log_connections | on | configuration file

log_destination | csvlog | configuration file

log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file

log_disconnections | on | configuration file

log_error_verbosity | default | configuration file

log_file_mode | 0600 | configuration file

log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file

log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file

log_lock_waits | on | configuration file

log_min_duration_statement | 0 | configuration file

log_min_error_statement | debug1 | configuration file

log_min_messages | debug1 | configuration file

log_rotation_size | 1GB | configuration file

log_temp_files | 0 | configuration file

log_timezone | localtime | configuration file

logging_collector | on | configuration file

maintenance_work_mem | 3GB | configuration file

max_connections | 10 | configuration file

max_locks_per_transaction | 256 | configuration file

max_parallel_workers_per_gather | 14 | configuration file

max_stack_depth | 2MB | environment variable

max_wal_size | 4GB | configuration file

max_worker_processes | 14 | configuration file

min_wal_size | 2GB | configuration file

parallel_setup_cost | 1000 | configuration file

parallel_tuple_cost | 0.012 | configuration file

port | 5432 | configuration file

random_page_cost | 22 | configuration file

seq_page_cost | 1 | configuration file

shared_buffers | 34GB | configuration file

shared_preload_libraries | pg_stat_statements | configuration file

ssl | on | configuration file

ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file

ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file

statement_timeout | 1000000s | configuration file

stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file

superuser_reserved_connections | 1 | configuration file

syslog_facility | local1 | configuration file

syslog_ident | postgres | configuration file

syslog_sequence_numbers | on | configuration file

temp_file_limit | 80GB | configuration file

TimeZone | localtime | configuration file

track_activities | on | configuration file

track_counts | on | configuration file

track_functions | all | configuration file

unix_socket_directories | /var/run/postgresql | configuration file

vacuum_cost_delay | 1ms | configuration file

vacuum_cost_limit | 5000 | configuration file

vacuum_cost_page_dirty | 200 | configuration file

vacuum_cost_page_hit | 10 | configuration file

vacuum_cost_page_miss | 100 | configuration file

wal_buffers | 16MB | configuration file

wal_compression | on | configuration file

wal_sync_method | fdatasync | configuration file

work_mem | 1468006kB | configuration file


The part of /etc/sysctl.conf I modified is:

vm.swappiness = 1

vm.dirty_background_bytes = 134217728

vm.dirty_bytes = 1073741824

vm.overcommit_ratio = 100

vm.zone_reclaim_mode = 0

kernel.numa_balancing = 0

kernel.sched_autogroup_enabled = 0

kernel.sched_migration_cost_ns = 5000000


The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.

I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.

Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.

Thanks!


Charles


--
Charles Nadeau Ph.D.

Re: [PERFORM] Very poor read performance, query independent

From
Andreas Kretschmer
Date:

Am 10.07.2017 um 16:03 schrieb Charles Nadeau:
> random_page_cost | 22


why such a high value for random_page_cost?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com



Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Andreas,

Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.
Thanks!

Charles

On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:


Am 10.07.2017 um 16:03 schrieb Charles Nadeau:
random_page_cost | 22


why such a high value for random_page_cost?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--

Re: [PERFORM] Very poor read performance, query independent

From
Igor Neyman
Date:

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Monday, July 10, 2017 11:48 AM
To: Andreas Kretschmer <andreas@a-kretschmer.de>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Andreas,

 

Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.

Thanks!

 

Charles

 

On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:



Am 10.07.2017 um 16:03 schrieb Charles Nadeau:

random_page_cost | 22



why such a high value for random_page_cost?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com


--

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

 

 

Considering RAM size of 72 GB and your database size of ~225GB, and also the fact that Postgres is the only app running on the server, probably 1/3 of your database resides in memory, so random_page_cost = 22 looks extremely high, probably it completely precludes index usage in your queries.

 

You should try this setting at least at its default value: random_page_cost =4, and probably go even lower.

Also, effective_cache_size is at least as big as your shared_buffers. Having 72GB RAM t effective_cache_size should be set around 64GB (again considering that Postgres is the only app running on the server).

 

Regards,

Igor Neyman

 

 

 

 

Re: [PERFORM] Very poor read performance, query independent

From
Jeff Janes
Date:
On Mon, Jul 10, 2017 at 7:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:


The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS.



That doesn't seem right.  Sequential is only 43% faster?  What job file are giving to fio?

What do you get if you do something simpler, like: 

time cat ~/$PGDATA/base/16402/*|wc -c

replacing 16402 with whatever your biggest database is.

Cheers,

Jeff

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Rick,

I applied the change you recommended but it didn't speed up the reads.
One thing I forgot to mention earlier is the speed of the backup made with the COPY operations seems almost normal: I have read speed of up to 85MB/s.
Thanks for your help!

Charles

On Mon, Jul 10, 2017 at 5:25 PM, Rick Otten <rottenwindfish@gmail.com> wrote:
Although probably not the root cause, at the least I would set up hugepages  ( https://www.postgresql.org/docs/9.6/static/kernel-resources.html#LINUX-HUGE-PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?).


On Mon, Jul 10, 2017 at 10:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:

*2x Intel Xeon E5550

*72GB RAM

*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:

Logical Drive: 3

Size: 273.4 GB

Fault Tolerance: 1+0

Heads: 255

Sectors Per Track: 32

Cylinders: 65535

Strip Size: 128 KB

Full Stripe Size: 256 KB

Status: OK

Caching: Enabled

Unique Identifier: 600508B1001037383941424344450A00

Disk Name: /dev/sdc

Mount Points: /mnt/data 273.4 GB

OS Status: LOCKED

Logical Drive Label: A00A194750123456789ABCDE516F

Mirror Group 0:

physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)

physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)

Mirror Group 1:

physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)

physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)

Drive Type: Data

Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.

Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"

Postgresql is the only application running on this server.


Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.


My configuration is:


name | current_setting | source

---------------------------------+------------------------------------------------+----------------------

application_name | psql | client

autovacuum_vacuum_scale_factor | 0 | configuration file

autovacuum_vacuum_threshold | 2000 | configuration file

checkpoint_completion_target | 0.9 | configuration file

checkpoint_timeout | 30min | configuration file

client_encoding | UTF8 | client

client_min_messages | log | configuration file

cluster_name | 9.6/main | configuration file

cpu_index_tuple_cost | 0.001 | configuration file

cpu_operator_cost | 0.0005 | configuration file

cpu_tuple_cost | 0.003 | configuration file

DateStyle | ISO, YMD | configuration file

default_statistics_target | 100 | configuration file

default_text_search_config | pg_catalog.english | configuration file

dynamic_shared_memory_type | posix | configuration file

effective_cache_size | 22GB | configuration file

effective_io_concurrency | 4 | configuration file

external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file

lc_messages | C | configuration file

lc_monetary | en_CA.UTF-8 | configuration file

lc_numeric | en_CA.UTF-8 | configuration file

lc_time | en_CA.UTF-8 | configuration file

listen_addresses | * | configuration file

lock_timeout | 100s | configuration file

log_autovacuum_min_duration | 0 | configuration file

log_checkpoints | on | configuration file

log_connections | on | configuration file

log_destination | csvlog | configuration file

log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file

log_disconnections | on | configuration file

log_error_verbosity | default | configuration file

log_file_mode | 0600 | configuration file

log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file

log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file

log_lock_waits | on | configuration file

log_min_duration_statement | 0 | configuration file

log_min_error_statement | debug1 | configuration file

log_min_messages | debug1 | configuration file

log_rotation_size | 1GB | configuration file

log_temp_files | 0 | configuration file

log_timezone | localtime | configuration file

logging_collector | on | configuration file

maintenance_work_mem | 3GB | configuration file

max_connections | 10 | configuration file

max_locks_per_transaction | 256 | configuration file

max_parallel_workers_per_gather | 14 | configuration file

max_stack_depth | 2MB | environment variable

max_wal_size | 4GB | configuration file

max_worker_processes | 14 | configuration file

min_wal_size | 2GB | configuration file

parallel_setup_cost | 1000 | configuration file

parallel_tuple_cost | 0.012 | configuration file

port | 5432 | configuration file

random_page_cost | 22 | configuration file

seq_page_cost | 1 | configuration file

shared_buffers | 34GB | configuration file

shared_preload_libraries | pg_stat_statements | configuration file

ssl | on | configuration file

ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file

ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file

statement_timeout | 1000000s | configuration file

stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file

superuser_reserved_connections | 1 | configuration file

syslog_facility | local1 | configuration file

syslog_ident | postgres | configuration file

syslog_sequence_numbers | on | configuration file

temp_file_limit | 80GB | configuration file

TimeZone | localtime | configuration file

track_activities | on | configuration file

track_counts | on | configuration file

track_functions | all | configuration file

unix_socket_directories | /var/run/postgresql | configuration file

vacuum_cost_delay | 1ms | configuration file

vacuum_cost_limit | 5000 | configuration file

vacuum_cost_page_dirty | 200 | configuration file

vacuum_cost_page_hit | 10 | configuration file

vacuum_cost_page_miss | 100 | configuration file

wal_buffers | 16MB | configuration file

wal_compression | on | configuration file

wal_sync_method | fdatasync | configuration file

work_mem | 1468006kB | configuration file


The part of /etc/sysctl.conf I modified is:

vm.swappiness = 1

vm.dirty_background_bytes = 134217728

vm.dirty_bytes = 1073741824

vm.overcommit_ratio = 100

vm.zone_reclaim_mode = 0

kernel.numa_balancing = 0

kernel.sched_autogroup_enabled = 0

kernel.sched_migration_cost_ns = 5000000


The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.

I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.

Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.

Thanks!


Charles


--
Charles Nadeau Ph.D.




--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Igor,

I reduced the value of random_page_cost to 4 but the read speed remains low.
Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?
Thanks for suggestions!

Charles

On Mon, Jul 10, 2017 at 8:35 PM, Igor Neyman <ineyman@perceptron.com> wrote:

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Monday, July 10, 2017 11:48 AM
To: Andreas Kretschmer <andreas@a-kretschmer.de>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Andreas,

 

Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.

Thanks!

 

Charles

 

On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:



Am 10.07.2017 um 16:03 schrieb Charles Nadeau:

random_page_cost | 22



why such a high value for random_page_cost?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com


--

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

 

 

Considering RAM size of 72 GB and your database size of ~225GB, and also the fact that Postgres is the only app running on the server, probably 1/3 of your database resides in memory, so random_page_cost = 22 looks extremely high, probably it completely precludes index usage in your queries.

 

You should try this setting at least at its default value: random_page_cost =4, and probably go even lower.

Also, effective_cache_size is at least as big as your shared_buffers. Having 72GB RAM t effective_cache_size should be set around 64GB (again considering that Postgres is the only app running on the server).

 

Regards,

Igor Neyman

 

 

 

 




--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Jeff,

I used fio in a quick benchmarking script inspired by https://smcleod.net/benchmarking-io/:
#!/bin/bash
#Random throughput
echo "Random throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=randread --ramp_time=4
#Random IOPS
echo "Random IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=randread --ramp_time=4
#Sequential throughput
echo "Sequential throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4
#Sequential IOPS
echo "Sequential IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=read --ramp_time=4

Performing the test you suggested, I get 128.5MB/s. Monitoring the test, I find that the throughput is constant from start to finish and that the iowait is also constant at 5%:
charles@hpdl380g6:~$ sudo sh -c 'time cat /mnt/data/postgresql/base/16385/* | wc -c'
[sudo] password for charles: 
1.62user 179.94system 29:50.79elapsed 10%CPU (0avgtext+0avgdata 1920maxresident)k
448026264inputs+0outputs (0major+117minor)pagefaults 0swaps
241297594904

After making the changes to HugePage suggested by Rick Otten (above), I found slightly better results (135.7MB/s):
charles@hpdl380g6:~$ sudo sh -c 'time cat /mnt/data/postgresql/base/16385/* | wc -c'
[sudo] password for charles: 
0.86user 130.84system 28:15.78elapsed 7%CPU (0avgtext+0avgdata 1820maxresident)k
471286792inputs+0outputs (1major+118minor)pagefaults 0swaps
241297594904

Could you suggest another way to benchmark random reads?

Thanks for your help!

Charles

On Mon, Jul 10, 2017 at 9:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Jul 10, 2017 at 7:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:


The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS.



That doesn't seem right.  Sequential is only 43% faster?  What job file are giving to fio?

What do you get if you do something simpler, like: 

time cat ~/$PGDATA/base/16402/*|wc -c

replacing 16402 with whatever your biggest database is.

Cheers,

Jeff



--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
After reducing random_page_cost to 4 and testing more, I can report that the aggregate read throughput for parallel sequential scan is about 90MB/s. However the throughput for sequential scan is still around 4MB/s.

One more question: if a query uses more than one table, can more than one table be read through a parallel sequential scan? I have many queries joining  tables and I noticed that there was never more than one table read in parallel.
Thanks!

Charles

On Mon, Jul 10, 2017 at 8:35 PM, Igor Neyman <ineyman@perceptron.com> wrote:

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Monday, July 10, 2017 11:48 AM
To: Andreas Kretschmer <andreas@a-kretschmer.de>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Andreas,

 

Because the ratio between the Sequential IOPS and Random IOPS is about 29. Taking into account that part of the data is in RAM, I obtained an "effective" ratio of about 22.

Thanks!

 

Charles

 

On Mon, Jul 10, 2017 at 5:35 PM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:



Am 10.07.2017 um 16:03 schrieb Charles Nadeau:

random_page_cost | 22



why such a high value for random_page_cost?

Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com


--

Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

 

 

Considering RAM size of 72 GB and your database size of ~225GB, and also the fact that Postgres is the only app running on the server, probably 1/3 of your database resides in memory, so random_page_cost = 22 looks extremely high, probably it completely precludes index usage in your queries.

 

You should try this setting at least at its default value: random_page_cost =4, and probably go even lower.

Also, effective_cache_size is at least as big as your shared_buffers. Having 72GB RAM t effective_cache_size should be set around 64GB (again considering that Postgres is the only app running on the server).

 

Regards,

Igor Neyman

 

 

 

 




--

Re: [PERFORM] Very poor read performance, query independent

From
Igor Neyman
Date:

 

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Igor,

 

I reduced the value of random_page_cost to 4 but the read speed remains low.

Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?

Thanks for suggestions!

 

Charles

 

No, they should not be equal.

From the docs:

 

effective_cache_size (integer)

Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).

So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.

 

Regards,

Igor

 

Re: [PERFORM] Very poor read performance, query independent

From
Igor Neyman
Date:

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Tuesday, July 11, 2017 10:34 AM
To: Charles Nadeau <charles.nadeau@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Igor,

 

I reduced the value of random_page_cost to 4 but the read speed remains low.

Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?

Thanks for suggestions!

 

Charles

 

No, they should not be equal.

From the docs:

 

effective_cache_size (integer)

Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).

So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.

 

Regards,

Igor

 

Also, maybe it’s time to look at execution plans (explain analyze) of specific slow queries, instead of trying to solve the problem “in general”.

 

Igor

 

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Igor,

The sum of effective_cache_size and shared_buffer will be higher than the physical memory I have. Is it OK?
Thanks!

Charles

On Tue, Jul 11, 2017 at 4:34 PM, Igor Neyman <ineyman@perceptron.com> wrote:

 

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Igor,

 

I reduced the value of random_page_cost to 4 but the read speed remains low.

Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?

Thanks for suggestions!

 

Charles

 

No, they should not be equal.

From the docs:

 

effective_cache_size (integer)

Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).

So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.

 

Regards,

Igor

 




--

Re: [PERFORM] Very poor read performance, query independent

From
Igor Neyman
Date:

 

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 11:25 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Attention: This email was sent from someone outside of Perceptron. Always exercise caution when opening attachments or clicking links from unknown senders or when receiving unexpected emails.

 

Igor,

 

The sum of effective_cache_size and shared_buffer will be higher than the physical memory I have. Is it OK?

Thanks!

 

Charles

 

Yes, that’s normal.

 

shared_buffers is the maximum that Postgres allowed to allocate, while effective_cache_size is just a number that optimizer takes into account when creating execution plan.

 

Igor

 

Re: [PERFORM] Very poor read performance, query independent

From
Merlin Moncure
Date:
On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic).
> Hardware is:
>
> *2x Intel Xeon E5550
>
> *72GB RAM
>
> *Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80%
> read/20% write) for Postgresql data only:
>
> The problem I have is very poor read. When I benchmark my array with fio I
> get random reads of about 200MB/s and 1100IOPS and sequential reads of about
> 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get
> at best 4MB/s. Also using dstat I can see that iowait time is at about 25%.
> This problem is not query-dependent.

Stop right there.     1100 iops * 8kb = ~8mb/sec raw which might
reasonably translate to 4mb/sec to the client. 200mb/sec random
read/sec on spinning media is simply not plausible; your benchmark is
lying to you.   Random reads on spinning media are absolutely going to
be storage bound.

merlin


Re: [PERFORM] Very poor read performance, query independent

From
"Joshua D. Drake"
Date:
On 07/11/2017 04:15 PM, Merlin Moncure wrote:
> On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau
> <charles.nadeau@gmail.com> wrote:
>> I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic).
>> Hardware is:
>>
>> *2x Intel Xeon E5550
>>
>> *72GB RAM
>>
>> *Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80%
>> read/20% write) for Postgresql data only:
>>
>> The problem I have is very poor read. When I benchmark my array with fio I
>> get random reads of about 200MB/s and 1100IOPS and sequential reads of about
>> 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get
>> at best 4MB/s. Also using dstat I can see that iowait time is at about 25%.
>> This problem is not query-dependent.
>
> Stop right there.     1100 iops * 8kb = ~8mb/sec raw which might
> reasonably translate to 4mb/sec to the client. 200mb/sec random
> read/sec on spinning media is simply not plausible;

Sure it is, if he had more than 4 disks ;) but he also isn't going to
get 1100 IOPS from 4 10k disks. The average 10k disk is going to get
around 130 IOPS . If he only has 4 then there is no way he is getting
1100 IOPS.

Using the above specs (4x146GB) the best he can reasonably hope for from
the drives themselves is about 50MB/s add in the 1GB FWBC and that is
how he is getting those high numbers for IOPS but that is because of
caching.

He may need to adjust his readahead as well as his kernel scheduler. At
a minimum he should be able to saturate the drives without issue.

JD



--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
*****     Unless otherwise stated, opinions are my own.   *****


Re: [PERFORM] Very poor read performance, query independent

From
Jeff Janes
Date:
On Tue, Jul 11, 2017 at 4:02 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:
Jeff,

I used fio in a quick benchmarking script inspired by https://smcleod.net/benchmarking-io/:
#!/bin/bash
#Random throughput
echo "Random throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=randread --ramp_time=4
#Random IOPS
echo "Random IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=randread --ramp_time=4
#Sequential throughput
echo "Sequential throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4
#Sequential IOPS
echo "Sequential IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=read --ramp_time=4

I don't think any of those are directly relevant to PostgreSQL, as it doesn't use direct IO, doesn't use libaio, and is rarely going to get anywhere near 256 iodepth.  So the best they can do is put a theoretical ceiling on the performance.  Also, random IO with a 4MB stride doesn't make any sense from a PostgreSQL perspective.

 

Performing the test you suggested, I get 128.5MB/s. Monitoring the test, I find that the throughput is constant from start to finish and that the iowait is also constant at 5%:

I would have expected it to do better than that.  Maybe you increase the kernel readahead setting.  I've found the default to be much too small.  But it doesn't make much difference to you, as you appear to be doing random IO in your queries, not sequential.


Could you suggest another way to benchmark random reads?

Your 1100 IOPS times 8kb block size gives about 8MB/s of throughput, which is close to what you report.  So I think I'd would instead focus on tuning your actual queries.  You say the problem is not query-dependent, but I think that that just means all the queries you looked at are similar.  If you looked at a query that can't use indexes, like count(unindexed_column) from biggest_table; you would find it doing much more IO than 4MB/s.

Can you pick the simplest query you actually care about, and post both an "explain (analyze, timing off)" and an "explain (analyze, buffers)" for it?  (Preferably turning "track_io_timing" on first).

One other question I had, you said you had "2x Intel Xeon E5550", which should be 8 CPU (or 16, if the hyperthreads 
are reported as separate CPUs).  But you also said: "Also using dstat I can see that iowait time is at about 25%".  Usually if there is only one thing going on on the server, then IOWAIT won't be more than reciprocal of #CPU.  Is the server busy doing other stuff at the same time you are benchmarking it?

Cheers,

Jeff

Re: [PERFORM] Very poor read performance, query independent

From
Jeff Janes
Date:
On Tue, Jul 11, 2017 at 4:42 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
On 07/11/2017 04:15 PM, Merlin Moncure wrote:
On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic).
Hardware is:

*2x Intel Xeon E5550

*72GB RAM

*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80%
read/20% write) for Postgresql data only:

The problem I have is very poor read. When I benchmark my array with fio I
get random reads of about 200MB/s and 1100IOPS and sequential reads of about
286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get
at best 4MB/s. Also using dstat I can see that iowait time is at about 25%.
This problem is not query-dependent.

Stop right there.     1100 iops * 8kb = ~8mb/sec raw which might
reasonably translate to 4mb/sec to the client. 200mb/sec random
read/sec on spinning media is simply not plausible;

Sure it is, if he had more than 4 disks ;)

Or more to the point here, if each random read is 4MB long.  Which makes it more like sequential reads, randomly-piecewise, rather than random reads. 
 
but he also isn't going to get 1100 IOPS from 4 10k disks. The average 10k disk is going to get around 130 IOPS . If he only has 4 then there is no way he is getting 1100 IOPS.

I wouldn't be sure.  He is using an iodepth of 256 in his benchmark.  It wouldn't be all that outrageous for a disk to be able to find 3 or 4 sectors per revolution it can read, when it has that many to choose from.

 Cheers,

Jeff

Re: [PERFORM] Very poor read performance, query independent

From
Mark Kirkwood
Date:
Hmm - how are you measuring that sequential scan speed of 4MB/s? I'd
recommend doing a very simple test e.g, here's one on my workstation -
13 GB single table on 1 SATA drive - cold cache after reboot, sequential
scan using Postgres 9.6.2:

bench=#  EXPLAIN SELECT count(*) FROM pgbench_accounts;
                                      QUERY PLAN
------------------------------------------------------------------------------------
  Aggregate  (cost=2889345.00..2889345.01 rows=1 width=8)
    ->  Seq Scan on pgbench_accounts (cost=0.00..2639345.00
rows=100000000 width=0)
(2 rows)


bench=#  SELECT pg_relation_size('pgbench_accounts');
  pg_relation_size
------------------
       13429514240
(1 row)

bench=# SELECT count(*) FROM pgbench_accounts;
    count
-----------
  100000000
(1 row)

Time: 118884.277 ms


So doing the math seq read speed is about 110MB/s (i.e 13 GB in 120
sec). Sure enough, while I was running the query iostat showed:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00  926.00    0.00 114.89     0.00
254.10     1.90    2.03    2.03    0.00   1.08 100.00


So might be useful for us to see something like that from your system -
note you need to check you really have flushed the cache, and that no
other apps are using the db.

regards

Mark

On 12/07/17 00:46, Charles Nadeau wrote:
> After reducing random_page_cost to 4 and testing more, I can report
> that the aggregate read throughput for parallel sequential scan is
> about 90MB/s. However the throughput for sequential scan is still
> around 4MB/s.
>



Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Igor,

I set shared_buffers to 24 GB and effective_cache_size to 64GB and I can see that the queries are faster due to the fact that the index are used more often. Knowing I have 72GB of RAM and the server is exclusively dedicated to Postgresql, what could be the maximum value for effective_cache?
Thanks!

Charles

On Tue, Jul 11, 2017 at 5:16 PM, Igor Neyman <ineyman@perceptron.com> wrote:

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Tuesday, July 11, 2017 10:34 AM
To: Charles Nadeau <charles.nadeau@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Tuesday, July 11, 2017 6:43 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Andreas Kretschmer <andreas@a-kretschmer.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Igor,

 

I reduced the value of random_page_cost to 4 but the read speed remains low.

Regarding effective_cache_size and shared_buffer, do you mean they should be both equal to 64GB?

Thanks for suggestions!

 

Charles

 

No, they should not be equal.

From the docs:

 

effective_cache_size (integer)

Sets the planner's assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used. When setting this parameter you should consider both PostgreSQL's shared buffers and the portion of the kernel's disk cache that will be used for PostgreSQL data files. Also, take into account the expected number of concurrent queries on different tables, since they will have to share the available space. This parameter has no effect on the size of shared memory allocated by PostgreSQL, nor does it reserve kernel disk cache; it is used only for estimation purposes. The system also does not assume data remains in the disk cache between queries. The default is 4 gigabytes (4GB).

So, I’d set shared_buffers at 24GB and effective_cache_size at 64GB.

 

Regards,

Igor

 

Also, maybe it’s time to look at execution plans (explain analyze) of specific slow queries, instead of trying to solve the problem “in general”.

 

Igor

 




--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Joshua,

I use noop as the scheduler because it is better to let the RAID controller re-arrange the IO operation before they reach the disk. Read ahead is set to 128:
charles@hpdl380g6:~$ cat /sys/block/sdc/queue/read_ahead_kb
128
charles@hpdl380g6:~$ cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq 
Thanks!

Charles

On Wed, Jul 12, 2017 at 1:42 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
On 07/11/2017 04:15 PM, Merlin Moncure wrote:
On Mon, Jul 10, 2017 at 9:03 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic).
Hardware is:

*2x Intel Xeon E5550

*72GB RAM

*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80%
read/20% write) for Postgresql data only:

The problem I have is very poor read. When I benchmark my array with fio I
get random reads of about 200MB/s and 1100IOPS and sequential reads of about
286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get
at best 4MB/s. Also using dstat I can see that iowait time is at about 25%.
This problem is not query-dependent.

Stop right there.     1100 iops * 8kb = ~8mb/sec raw which might
reasonably translate to 4mb/sec to the client. 200mb/sec random
read/sec on spinning media is simply not plausible;

Sure it is, if he had more than 4 disks ;) but he also isn't going to get 1100 IOPS from 4 10k disks. The average 10k disk is going to get around 130 IOPS . If he only has 4 then there is no way he is getting 1100 IOPS.

Using the above specs (4x146GB) the best he can reasonably hope for from the drives themselves is about 50MB/s add in the 1GB FWBC and that is how he is getting those high numbers for IOPS but that is because of caching.

He may need to adjust his readahead as well as his kernel scheduler. At a minimum he should be able to saturate the drives without issue.

JD



--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
*****     Unless otherwise stated, opinions are my own.   *****



--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Jeff,

Here are the 2 EXPLAINs for one of my simplest query:

flows=# SET track_io_timing = on;
LOG:  duration: 24.101 ms  statement: SET track_io_timing = on;
SET
flows=# explain (analyze, timing off) SELECT DISTINCT
flows-#    srcaddr,
flows-#    dstaddr,
flows-#    dstport,
flows-#    COUNT(*) AS conversation,
flows-#    SUM(doctets) / 1024 / 1024 AS mbytes 
flows-# FROM
flows-#    flowscompact,
flows-#    mynetworks 
flows-# WHERE
flows-#    mynetworks.ipaddr >>= flowscompact.srcaddr 
flows-#    AND dstaddr IN 
flows-#    (
flows(#       SELECT
flows(#          dstaddr 
flows(#       FROM
flows(#          dstexterne
flows(#    )
flows-# GROUP BY
flows-#    srcaddr,
flows-#    dstaddr,
flows-#    dstport 
flows-# ORDER BY
flows-#    mbytes DESC LIMIT 50;
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.3", size 1073741824
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.4", size 1073741824
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.5", size 639696896
LOG:  duration: 2632108.352 ms  statement: explain (analyze, timing off) SELECT DISTINCT
   srcaddr,
   dstaddr,
   dstport,
   COUNT(*) AS conversation,
   SUM(doctets) / 1024 / 1024 AS mbytes 
FROM
   flowscompact,
   mynetworks 
WHERE
   mynetworks.ipaddr >>= flowscompact.srcaddr 
   AND dstaddr IN 
   (
      SELECT
         dstaddr 
      FROM
         dstexterne
   )
GROUP BY
   srcaddr,
   dstaddr,
   dstport 
ORDER BY
   mbytes DESC LIMIT 50;
                                                                                                  QUERY PLAN                                                                                                   
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=37762321.83..37762321.98 rows=50 width=52) (actual rows=50 loops=1)
   ->  Unique  (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual rows=50 loops=1)
         ->  Sort  (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual rows=50 loops=1)
               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
               Sort Method: quicksort  Memory: 563150kB
               ->  GroupAggregate  (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual rows=4691734 loops=1)
                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
                     ->  Sort  (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual rows=81896988 loops=1)
                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
                           Sort Method: external merge  Disk: 2721856kB
                           ->  Gather  (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual rows=81896988 loops=1)
                                 Workers Planned: 9
                                 Workers Launched: 9
                                 ->  Hash Semi Join  (cost=19462936.00..37622883.23 rows=249324 width=20) (actual rows=8189699 loops=10)
                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)
                                       ->  Nested Loop  (cost=0.03..18159012.30 rows=249324 width=20) (actual rows=45499045 loops=10)
                                             ->  Parallel Seq Scan on flows  (cost=0.00..16039759.79 rows=62330930 width=20) (actual rows=54155970 loops=10)
                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual rows=1 loops=541559704)
                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
                                                   Heap Fetches: 48679396
                                       ->  Hash  (cost=19462896.74..19462896.74 rows=11210 width=4) (actual rows=3099798 loops=10)
                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 141746kB
                                             ->  HashAggregate  (cost=19462829.48..19462863.11 rows=11210 width=4) (actual rows=3099798 loops=10)
                                                   Group Key: flows_1.dstaddr
                                                   ->  Nested Loop Anti Join  (cost=0.12..19182620.78 rows=560417390 width=4) (actual rows=113420172 loops=10)
                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
                                                         Rows Removed by Join Filter: 453681377
                                                         ->  Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1  (cost=0.12..9091067.70 rows=560978368 width=4) (actual rows=541559704 loops=10)
                                                               Heap Fetches: 91
                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual rows=2 loops=5415597040)
                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual rows=4 loops=10)
 Planning time: 62.066 ms
 Execution time: 2631923.716 ms
(33 rows)

flows=# explain (analyze, buffers) SELECT DISTINCT
flows-#    srcaddr,
flows-#    dstaddr,
flows-#    dstport,
flows-#    COUNT(*) AS conversation,
flows-#    SUM(doctets) / 1024 / 1024 AS mbytes 
flows-# FROM
flows-#    flowscompact,
flows-#    mynetworks 
flows-# WHERE
flows-#    mynetworks.ipaddr >>= flowscompact.srcaddr 
flows-#    AND dstaddr IN 
flows-#    (
flows(#       SELECT
flows(#          dstaddr 
flows(#       FROM
flows(#          dstexterne
flows(#    )
flows-# GROUP BY
flows-#    srcaddr,
flows-#    dstaddr,
flows-#    dstport 
flows-# ORDER BY
flows-#    mbytes DESC LIMIT 50;
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824
LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896
LOG:  duration: 2765020.327 ms  statement: explain (analyze, buffers) SELECT DISTINCT
   srcaddr,
   dstaddr,
   dstport,
   COUNT(*) AS conversation,
   SUM(doctets) / 1024 / 1024 AS mbytes 
FROM
   flowscompact,
   mynetworks 
WHERE
   mynetworks.ipaddr >>= flowscompact.srcaddr 
   AND dstaddr IN 
   (
      SELECT
         dstaddr 
      FROM
         dstexterne
   )
GROUP BY
   srcaddr,
   dstaddr,
   dstport 
ORDER BY
   mbytes DESC LIMIT 50;
                                                                                                              QUERY PLAN                                                                                                              
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)
   Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
   I/O Timings: read=5323746.860
   ->  Unique  (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)
         Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
         I/O Timings: read=5323746.860
         ->  Sort  (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)
               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
               Sort Method: quicksort  Memory: 563150kB
               Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
               I/O Timings: read=5323746.860
               ->  GroupAggregate  (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)
                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
                     Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
                     I/O Timings: read=5323746.860
                     ->  Sort  (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)
                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
                           Sort Method: external merge  Disk: 2721856kB
                           Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244
                           I/O Timings: read=5323746.860
                           ->  Gather  (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)
                                 Workers Planned: 9
                                 Workers Launched: 9
                                 Buffers: shared hit=1116590559 read=15851133
                                 I/O Timings: read=5323746.860
                                 ->  Hash Semi Join  (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)
                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)
                                       Buffers: shared hit=1116588309 read=15851133
                                       I/O Timings: read=5323746.860
                                       ->  Nested Loop  (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)
                                             Buffers: shared hit=996551813 read=15851133
                                             I/O Timings: read=5323746.860
                                             ->  Parallel Seq Scan on flows  (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)
                                                   Buffers: shared hit=1634 read=15851133
                                                   I/O Timings: read=5323746.860
                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)
                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
                                                   Heap Fetches: 59971474
                                                   Buffers: shared hit=996550152
                                       ->  Hash  (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)
                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 141746kB
                                             Buffers: shared hit=120036496
                                             ->  HashAggregate  (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)
                                                   Group Key: flows_1.dstaddr
                                                   Buffers: shared hit=120036496
                                                   ->  Nested Loop Anti Join  (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)
                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
                                                         Rows Removed by Join Filter: 453681377
                                                         Buffers: shared hit=120036496
                                                         ->  Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1  (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)
                                                               Heap Fetches: 91
                                                               Buffers: shared hit=120036459
                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)
                                                               Buffers: shared hit=10
                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)
                                                                     Buffers: shared hit=10
 Planning time: 6.689 ms
 Execution time: 2764860.853 ms
(58 rows)

Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.
Thanks!

Charles

On Wed, Jul 12, 2017 at 2:39 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Jul 11, 2017 at 4:02 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:
Jeff,

I used fio in a quick benchmarking script inspired by https://smcleod.net/benchmarking-io/:
#!/bin/bash
#Random throughput
echo "Random throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=randread --ramp_time=4
#Random IOPS
echo "Random IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=randread --ramp_time=4
#Sequential throughput
echo "Sequential throughput"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4
#Sequential IOPS
echo "Sequential IOPS"
sync
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=256 --size=4G --readwrite=read --ramp_time=4

I don't think any of those are directly relevant to PostgreSQL, as it doesn't use direct IO, doesn't use libaio, and is rarely going to get anywhere near 256 iodepth.  So the best they can do is put a theoretical ceiling on the performance.  Also, random IO with a 4MB stride doesn't make any sense from a PostgreSQL perspective.

 

Performing the test you suggested, I get 128.5MB/s. Monitoring the test, I find that the throughput is constant from start to finish and that the iowait is also constant at 5%:

I would have expected it to do better than that.  Maybe you increase the kernel readahead setting.  I've found the default to be much too small.  But it doesn't make much difference to you, as you appear to be doing random IO in your queries, not sequential.


Could you suggest another way to benchmark random reads?

Your 1100 IOPS times 8kb block size gives about 8MB/s of throughput, which is close to what you report.  So I think I'd would instead focus on tuning your actual queries.  You say the problem is not query-dependent, but I think that that just means all the queries you looked at are similar.  If you looked at a query that can't use indexes, like count(unindexed_column) from biggest_table; you would find it doing much more IO than 4MB/s.

Can you pick the simplest query you actually care about, and post both an "explain (analyze, timing off)" and an "explain (analyze, buffers)" for it?  (Preferably turning "track_io_timing" on first).

One other question I had, you said you had "2x Intel Xeon E5550", which should be 8 CPU (or 16, if the hyperthreads 
are reported as separate CPUs).  But you also said: "Also using dstat I can see that iowait time is at about 25%".  Usually if there is only one thing going on on the server, then IOWAIT won't be more than reciprocal of #CPU.  Is the server busy doing other stuff at the same time you are benchmarking it?

Cheers,

Jeff



--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Rick,

Should the number of page should always be correlated to the VmPeak of the postmaster or could it be set to reflect shared_buffer or another setting?
Thanks!

Charles

On Mon, Jul 10, 2017 at 5:25 PM, Rick Otten <rottenwindfish@gmail.com> wrote:
Although probably not the root cause, at the least I would set up hugepages  ( https://www.postgresql.org/docs/9.6/static/kernel-resources.html#LINUX-HUGE-PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?).


On Mon, Jul 10, 2017 at 10:03 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:

*2x Intel Xeon E5550

*72GB RAM

*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:

Logical Drive: 3

Size: 273.4 GB

Fault Tolerance: 1+0

Heads: 255

Sectors Per Track: 32

Cylinders: 65535

Strip Size: 128 KB

Full Stripe Size: 256 KB

Status: OK

Caching: Enabled

Unique Identifier: 600508B1001037383941424344450A00

Disk Name: /dev/sdc

Mount Points: /mnt/data 273.4 GB

OS Status: LOCKED

Logical Drive Label: A00A194750123456789ABCDE516F

Mirror Group 0:

physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)

physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)

Mirror Group 1:

physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)

physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)

Drive Type: Data

Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.

Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"

Postgresql is the only application running on this server.


Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.


My configuration is:


name | current_setting | source

---------------------------------+------------------------------------------------+----------------------

application_name | psql | client

autovacuum_vacuum_scale_factor | 0 | configuration file

autovacuum_vacuum_threshold | 2000 | configuration file

checkpoint_completion_target | 0.9 | configuration file

checkpoint_timeout | 30min | configuration file

client_encoding | UTF8 | client

client_min_messages | log | configuration file

cluster_name | 9.6/main | configuration file

cpu_index_tuple_cost | 0.001 | configuration file

cpu_operator_cost | 0.0005 | configuration file

cpu_tuple_cost | 0.003 | configuration file

DateStyle | ISO, YMD | configuration file

default_statistics_target | 100 | configuration file

default_text_search_config | pg_catalog.english | configuration file

dynamic_shared_memory_type | posix | configuration file

effective_cache_size | 22GB | configuration file

effective_io_concurrency | 4 | configuration file

external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file

lc_messages | C | configuration file

lc_monetary | en_CA.UTF-8 | configuration file

lc_numeric | en_CA.UTF-8 | configuration file

lc_time | en_CA.UTF-8 | configuration file

listen_addresses | * | configuration file

lock_timeout | 100s | configuration file

log_autovacuum_min_duration | 0 | configuration file

log_checkpoints | on | configuration file

log_connections | on | configuration file

log_destination | csvlog | configuration file

log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file

log_disconnections | on | configuration file

log_error_verbosity | default | configuration file

log_file_mode | 0600 | configuration file

log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file

log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file

log_lock_waits | on | configuration file

log_min_duration_statement | 0 | configuration file

log_min_error_statement | debug1 | configuration file

log_min_messages | debug1 | configuration file

log_rotation_size | 1GB | configuration file

log_temp_files | 0 | configuration file

log_timezone | localtime | configuration file

logging_collector | on | configuration file

maintenance_work_mem | 3GB | configuration file

max_connections | 10 | configuration file

max_locks_per_transaction | 256 | configuration file

max_parallel_workers_per_gather | 14 | configuration file

max_stack_depth | 2MB | environment variable

max_wal_size | 4GB | configuration file

max_worker_processes | 14 | configuration file

min_wal_size | 2GB | configuration file

parallel_setup_cost | 1000 | configuration file

parallel_tuple_cost | 0.012 | configuration file

port | 5432 | configuration file

random_page_cost | 22 | configuration file

seq_page_cost | 1 | configuration file

shared_buffers | 34GB | configuration file

shared_preload_libraries | pg_stat_statements | configuration file

ssl | on | configuration file

ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file

ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file

statement_timeout | 1000000s | configuration file

stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file

superuser_reserved_connections | 1 | configuration file

syslog_facility | local1 | configuration file

syslog_ident | postgres | configuration file

syslog_sequence_numbers | on | configuration file

temp_file_limit | 80GB | configuration file

TimeZone | localtime | configuration file

track_activities | on | configuration file

track_counts | on | configuration file

track_functions | all | configuration file

unix_socket_directories | /var/run/postgresql | configuration file

vacuum_cost_delay | 1ms | configuration file

vacuum_cost_limit | 5000 | configuration file

vacuum_cost_page_dirty | 200 | configuration file

vacuum_cost_page_hit | 10 | configuration file

vacuum_cost_page_miss | 100 | configuration file

wal_buffers | 16MB | configuration file

wal_compression | on | configuration file

wal_sync_method | fdatasync | configuration file

work_mem | 1468006kB | configuration file


The part of /etc/sysctl.conf I modified is:

vm.swappiness = 1

vm.dirty_background_bytes = 134217728

vm.dirty_bytes = 1073741824

vm.overcommit_ratio = 100

vm.zone_reclaim_mode = 0

kernel.numa_balancing = 0

kernel.sched_autogroup_enabled = 0

kernel.sched_migration_cost_ns = 5000000


The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.

I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.

Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.

Thanks!


Charles


--
Charles Nadeau Ph.D.




--

Re: [PERFORM] Very poor read performance, query independent

From
Igor Neyman
Date:

 

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Wednesday, July 12, 2017 3:21 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Igor,

 

I set shared_buffers to 24 GB and effective_cache_size to 64GB and I can see that the queries are faster due to the fact that the index are used more often. Knowing I have 72GB of RAM and the server is exclusively dedicated to Postgresql, what could be the maximum value for effective_cache?

Thanks!

 

Charles

 

64GB for effective_cache_size should be good enough, adding couple more GB wouldn’t change much.

 

Igor

Re: [PERFORM] Very poor read performance, query independent

From
Rick Otten
Date:


On Wed, Jul 12, 2017 at 9:38 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:
Rick,

Should the number of page should always be correlated to the VmPeak of the postmaster or could it be set to reflect shared_buffer or another setting?
Thanks!


The documentation implies that you may need to adjust its size when you change shared_buffer settings. 

I usually check it every now and then (I haven't build a formal monitor yet.) to see if all of the huge pages are free/used and if it looks like they are all getting consumed - consider bumping it higher.  If there are lots free, you are probably fine.

cat /proc/meminfo | grep -i "^huge"

--

Also regarding my note on effective_io_concurrency, which I'm not sure you tried tweaking yet.

With file system and hardware caching between you and your spindles, your best setting for effective_io_concurrency may be much higher than the actual number of spindles.   It is worth experimenting with.   If you can, try several values.  You can use pg_bench to put consistent workloads on your database for measurement purposes.


Charles

On Mon, Jul 10, 2017 at 5:25 PM, Rick Otten <rottenwindfish@gmail.com> wrote:
Although probably not the root cause, at the least I would set up hugepages  ( https://www.postgresql.org/docs/9.6/static/kernel-resources.html#LINUX-HUGE-PAGES ), and bump effective_io_concurrency up quite a bit as well (256 ?).

Re: [PERFORM] Very poor read performance, query independent

From
bricklen
Date:
On Wed, Jul 12, 2017 at 12:30 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

I use noop as the scheduler because it is better to let the RAID controller re-arrange the IO operation before they reach the disk. Read ahead is set to 128:
charles@hpdl380g6:~$ cat /sys/block/sdc/queue/read_ahead_kb
128
charles@hpdl380g6:~$ cat /sys/block/sdc/queue/scheduler
[noop] deadline cfq 


Perhaps pg_test_fsync (https://www.postgresql.org/docs/9.6/static/pgtestfsync.html) and pg_test_timing will help shed some light here, or at the very least give some numbers to compare against.

Re: [PERFORM] Very poor read performance, query independent

From
Igor Neyman
Date:

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Wednesday, July 12, 2017 6:05 AM
To: Jeff Janes <jeff.janes@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

 

flows=# explain (analyze, buffers) SELECT DISTINCT

flows-#    srcaddr,

flows-#    dstaddr,

flows-#    dstport,

flows-#    COUNT(*) AS conversation,

flows-#    SUM(doctets) / 1024 / 1024 AS mbytes 

flows-# FROM

flows-#    flowscompact,

flows-#    mynetworks 

flows-# WHERE

flows-#    mynetworks.ipaddr >>= flowscompact.srcaddr 

flows-#    AND dstaddr IN 

flows-#    (

flows(#       SELECT

flows(#          dstaddr 

flows(#       FROM

flows(#          dstexterne

flows(#    )

flows-# GROUP BY

flows-#    srcaddr,

flows-#    dstaddr,

flows-#    dstport 

flows-# ORDER BY

flows-#    mbytes DESC LIMIT 50;

LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824

LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824

LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896

LOG:  duration: 2765020.327 ms  statement: explain (analyze, buffers) SELECT DISTINCT

   srcaddr,

   dstaddr,

   dstport,

   COUNT(*) AS conversation,

   SUM(doctets) / 1024 / 1024 AS mbytes 

FROM

   flowscompact,

   mynetworks 

WHERE

   mynetworks.ipaddr >>= flowscompact.srcaddr 

   AND dstaddr IN 

   (

      SELECT

         dstaddr 

      FROM

         dstexterne

   )

GROUP BY

   srcaddr,

   dstaddr,

   dstport 

ORDER BY

   mbytes DESC LIMIT 50;

                                                                                                              QUERY PLAN                                                                                                              

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)

   Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

   I/O Timings: read=5323746.860

   ->  Unique  (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)

         Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

         I/O Timings: read=5323746.860

         ->  Sort  (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)

               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))

               Sort Method: quicksort  Memory: 563150kB

               Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

               I/O Timings: read=5323746.860

               ->  GroupAggregate  (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)

                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport

                     Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

                     I/O Timings: read=5323746.860

                     ->  Sort  (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)

                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport

                           Sort Method: external merge  Disk: 2721856kB

                           Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

                           I/O Timings: read=5323746.860

                           ->  Gather  (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)

                                 Workers Planned: 9

                                 Workers Launched: 9

                                 Buffers: shared hit=1116590559 read=15851133

                                 I/O Timings: read=5323746.860

                                 ->  Hash Semi Join  (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)

                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)

                                       Buffers: shared hit=1116588309 read=15851133

                                       I/O Timings: read=5323746.860

                                       ->  Nested Loop  (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)

                                             Buffers: shared hit=996551813 read=15851133

                                             I/O Timings: read=5323746.860

                                             ->  Parallel Seq Scan on flows  (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)

                                                   Buffers: shared hit=1634 read=15851133

                                                   I/O Timings: read=5323746.860

                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)

                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)

                                                   Heap Fetches: 59971474

                                                   Buffers: shared hit=996550152

                                       ->  Hash  (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)

                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 141746kB

                                             Buffers: shared hit=120036496

                                             ->  HashAggregate  (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)

                                                   Group Key: flows_1.dstaddr

                                                   Buffers: shared hit=120036496

                                                   ->  Nested Loop Anti Join  (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)

                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)

                                                         Rows Removed by Join Filter: 453681377

                                                         Buffers: shared hit=120036496

                                                         ->  Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1  (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)

                                                               Heap Fetches: 91

                                                               Buffers: shared hit=120036459

                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)

                                                               Buffers: shared hit=10

                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)

                                                                     Buffers: shared hit=10

 Planning time: 6.689 ms

 Execution time: 2764860.853 ms

(58 rows)

 

Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.

Thanks!

 

Charles

 

Charles,

 

In your original posting I couldn’t find what value you set for temp_buffers.

Considering you have plenty of RAM, try setting temp_buffers=’6GB’ and then run ‘explain (analyze, buffers) select…’ in the same session. This should alleviate “disk sort’ problem.

 

Also, could you post the structure of flowscompact, mynetworks, and dstextern tables with all the indexes and number of rows. Actually, are they all – tables, or some of them – views?

 

Igor

 

 

 

 

Re: [PERFORM] Very poor read performance, query independent

From
Igor Neyman
Date:

 

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Wednesday, July 12, 2017 6:05 AM
To: Jeff Janes <jeff.janes@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

 

flows=# explain (analyze, buffers) SELECT DISTINCT

flows-#    srcaddr,

flows-#    dstaddr,

flows-#    dstport,

flows-#    COUNT(*) AS conversation,

flows-#    SUM(doctets) / 1024 / 1024 AS mbytes 

flows-# FROM

flows-#    flowscompact,

flows-#    mynetworks 

flows-# WHERE

flows-#    mynetworks.ipaddr >>= flowscompact.srcaddr 

flows-#    AND dstaddr IN 

flows-#    (

flows(#       SELECT

flows(#          dstaddr 

flows(#       FROM

flows(#          dstexterne

flows(#    )

flows-# GROUP BY

flows-#    srcaddr,

flows-#    dstaddr,

flows-#    dstport 

flows-# ORDER BY

flows-#    mbytes DESC LIMIT 50;

LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824

LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824

LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896

LOG:  duration: 2765020.327 ms  statement: explain (analyze, buffers) SELECT DISTINCT

   srcaddr,

   dstaddr,

   dstport,

   COUNT(*) AS conversation,

   SUM(doctets) / 1024 / 1024 AS mbytes 

FROM

   flowscompact,

   mynetworks 

WHERE

   mynetworks.ipaddr >>= flowscompact.srcaddr 

   AND dstaddr IN 

   (

      SELECT

         dstaddr 

      FROM

         dstexterne

   )

GROUP BY

   srcaddr,

   dstaddr,

   dstport 

ORDER BY

   mbytes DESC LIMIT 50;

                                                                                                              QUERY PLAN                                                                                                              

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)

   Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

   I/O Timings: read=5323746.860

   ->  Unique  (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)

         Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

         I/O Timings: read=5323746.860

         ->  Sort  (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)

               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))

               Sort Method: quicksort  Memory: 563150kB

               Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

               I/O Timings: read=5323746.860

               ->  GroupAggregate  (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)

                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport

                     Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

                     I/O Timings: read=5323746.860

                     ->  Sort  (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)

                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport

                           Sort Method: external merge  Disk: 2721856kB

                           Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

                           I/O Timings: read=5323746.860

                           ->  Gather  (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)

                                 Workers Planned: 9

                                 Workers Launched: 9

                                 Buffers: shared hit=1116590559 read=15851133

                                 I/O Timings: read=5323746.860

                                 ->  Hash Semi Join  (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)

                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)

                                       Buffers: shared hit=1116588309 read=15851133

                                       I/O Timings: read=5323746.860

                                       ->  Nested Loop  (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)

                                             Buffers: shared hit=996551813 read=15851133

                                             I/O Timings: read=5323746.860

                                             ->  Parallel Seq Scan on flows  (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)

                                                   Buffers: shared hit=1634 read=15851133

                                                   I/O Timings: read=5323746.860

                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)

                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)

                                                   Heap Fetches: 59971474

                                                   Buffers: shared hit=996550152

                                       ->  Hash  (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)

                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 141746kB

                                             Buffers: shared hit=120036496

                                             ->  HashAggregate  (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)

                                                   Group Key: flows_1.dstaddr

                                                   Buffers: shared hit=120036496

                                                   ->  Nested Loop Anti Join  (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)

                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)

                                                         Rows Removed by Join Filter: 453681377

                                                         Buffers: shared hit=120036496

                                                         ->  Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1  (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)

                                                               Heap Fetches: 91

                                                               Buffers: shared hit=120036459

                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)

                                                               Buffers: shared hit=10

                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)

                                                                     Buffers: shared hit=10

 Planning time: 6.689 ms

 Execution time: 2764860.853 ms

(58 rows)

 

Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.

Thanks!

 

Charles

 

Charles,

 

In your original posting I couldn’t find what value you set for temp_buffers.

Considering you have plenty of RAM, try setting temp_buffers=’6GB’ and then run ‘explain (analyze, buffers) select…’ in the same session. This should alleviate “disk sort’ problem.

 

Also, could you post the structure of flowscompact, mynetworks, and dstextern tables with all the indexes and number of rows. Actually, are they all – tables, or some of them – views?

 

Igor

 

 

Sorry, I misstated the parameter to change.

It is work_mem (not temp_buffers) you should try to increase to 6GB.

 

Igor

 

 

Re: [PERFORM] Very poor read performance, query independent

From
Jeff Janes
Date:
On Wed, Jul 12, 2017 at 3:04 AM, Charles Nadeau <charles.nadeau@gmail.com> wrote:
Jeff,

Here are the 2 EXPLAINs for one of my simplest query:


It looks like dstexterne and flowcompact are both views over flow.  Can you share the definition of those views?

I think the iowait > 12.5% is due to the parallel query execution.  But then the question is, why is it only 25% when you have 10 fold parallelism?

It certainly looks like you are doing more than 4MB/s there, so maybe something is wrong with the instrumentation, or how you are interpreting it.  

Although it is still less than perhaps it could do.  To put a baseline on what you can expect out of parallel seq scans, can you do something like:

explain (analyze, buffers) select avg(doctets) from flow;

Cheers,

Jeff

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Mark,

First I must say that I changed my disks configuration from 4 disks in RAID 10 to 5 disks in RAID 0 because I almost ran out of disk space during the last ingest of data.
Here is the result test you asked. It was done with a cold cache:
flows=# \timing
Timing is on.
flows=# explain select count(*) from flows;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=17214914.09..17214914.09 rows=1 width=8)
   ->  Gather  (cost=17214914.07..17214914.09 rows=1 width=8)
         Workers Planned: 1
         ->  Partial Aggregate  (cost=17213914.07..17213914.07 rows=1 width=8)
               ->  Parallel Seq Scan on flows  (cost=0.00..17019464.49 rows=388899162 width=0)
(5 rows)

Time: 171.835 ms
flows=# select pg_relation_size('flows');
 pg_relation_size 
------------------
     129865867264
(1 row)

Time: 57.157 ms
flows=# select count(*) from flows;
LOG:  duration: 625546.522 ms  statement: select count(*) from flows;
   count   
-----------
 589831190
(1 row)

Time: 625546.662 ms
The throughput reported by Postgresql is almost 198MB/s, and the throughput as mesured by dstat during the query execution was between 25 and 299MB/s. It is much better than what I had before! The i/o wait was about 12% all through the query. One thing I noticed is the discrepency between the read throughput reported by pg_activity and the one reported by dstat: pg_activity always report a value lower than dstat.

Besides the change of disks configuration, here is what contributed the most to the improvment of the performance so far:
Using Hugepage
Increasing effective_io_concurrency to 256
Reducing random_page_cost from 22 to 4
Reducing min_parallel_relation_size to 512kB to have more workers when doing sequential parallel scan of my biggest table

Thanks for recomending this test, I now know what the real throughput should be!

Charles

On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
Hmm - how are you measuring that sequential scan speed of 4MB/s? I'd recommend doing a very simple test e.g, here's one on my workstation - 13 GB single table on 1 SATA drive - cold cache after reboot, sequential scan using Postgres 9.6.2:

bench=#  EXPLAIN SELECT count(*) FROM pgbench_accounts;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Aggregate  (cost=2889345.00..2889345.01 rows=1 width=8)
   ->  Seq Scan on pgbench_accounts (cost=0.00..2639345.00 rows=100000000 width=0)
(2 rows)


bench=#  SELECT pg_relation_size('pgbench_accounts');
 pg_relation_size
------------------
      13429514240
(1 row)

bench=# SELECT count(*) FROM pgbench_accounts;
   count
-----------
 100000000
(1 row)

Time: 118884.277 ms


So doing the math seq read speed is about 110MB/s (i.e 13 GB in 120 sec). Sure enough, while I was running the query iostat showed:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00  926.00    0.00 114.89     0.00   254.10     1.90    2.03    2.03    0.00   1.08 100.00


So might be useful for us to see something like that from your system - note you need to check you really have flushed the cache, and that no other apps are using the db.

regards

Mark


On 12/07/17 00:46, Charles Nadeau wrote:
After reducing random_page_cost to 4 and testing more, I can report that the aggregate read throughput for parallel sequential scan is about 90MB/s. However the throughput for sequential scan is still around 4MB/s.





--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Igor,

Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?
Here is the explain you requested with work_mem set to 6GB:
flows=# set work_mem='6GB';
SET
flows=# explain (analyze, buffers) SELECT DISTINCT
   srcaddr,
   dstaddr,
   dstport,
   COUNT(*) AS conversation,
   SUM(doctets) / 1024 / 1024 AS mbytes
FROM
   flowscompact,
   mynetworks
WHERE
   mynetworks.ipaddr >>= flowscompact.srcaddr
   AND dstaddr IN
   (
      SELECT
         dstaddr
      FROM
         dstexterne
   )
GROUP BY
   srcaddr,
   dstaddr,
   dstport
ORDER BY
   mbytes DESC LIMIT 50;
                                                                                           QUERY PLAN                                                                                           
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)
   Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
   ->  Unique  (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)
         Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
         ->  Sort  (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)
               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))
               Sort Method: quicksort  Memory: 654395kB
               Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
               ->  GroupAggregate  (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)
                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport
                     Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
                     ->  Sort  (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)
                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport
                           Sort Method: external merge  Disk: 3049216kB
                           Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154
                           ->  Gather  (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)
                                 Workers Planned: 12
                                 Workers Launched: 12
                                 Buffers: shared hit=728798037 read=82974833
                                 ->  Hash Semi Join  (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)
                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)
                                       Buffers: shared hit=728795193 read=82974833
                                       ->  Nested Loop  (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)
                                             Buffers: shared hit=590692229 read=14991777
                                             ->  Parallel Seq Scan on flows  (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)
                                                   Buffers: shared hit=860990 read=14991777
                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)
                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)
                                                   Heap Fetches: 0
                                                   Buffers: shared hit=589831203
                                       ->  Hash  (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)
                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 166486kB
                                             Buffers: shared hit=138102964 read=67983056
                                             ->  HashAggregate  (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)
                                                   Group Key: flows_1.dstaddr
                                                   Buffers: shared hit=138102964 read=67983056
                                                   ->  Nested Loop Anti Join  (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)
                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)
                                                         Rows Removed by Join Filter: 503353617
                                                         Buffers: shared hit=138102964 read=67983056
                                                         ->  Seq Scan on flows flows_1  (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)
                                                               Buffers: shared hit=138102915 read=67983056
                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)
                                                               Buffers: shared hit=13
                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)
                                                                     Buffers: shared hit=13
 Planning time: 0.941 ms
 Execution time: 2228345.171 ms
(48 rows)

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query. 

flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.
flows=# \d+ flowscompact;
                        View "public.flowscompact"
  Column   |           Type           | Modifiers | Storage | Description 
-----------+--------------------------+-----------+---------+-------------
 flow_id   | bigint                   |           | plain   | 
 sysuptime | bigint                   |           | plain   | 
 exaddr    | ip4                      |           | plain   | 
 dpkts     | integer                  |           | plain   | 
 doctets   | bigint                   |           | plain   | 
 first     | bigint                   |           | plain   | 
 last      | bigint                   |           | plain   | 
 srcaddr   | ip4                      |           | plain   | 
 dstaddr   | ip4                      |           | plain   | 
 srcport   | integer                  |           | plain   | 
 dstport   | integer                  |           | plain   | 
 prot      | smallint                 |           | plain   | 
 tos       | smallint                 |           | plain   | 
 tcp_flags | smallint                 |           | plain   | 
 timestamp | timestamp with time zone |           | plain   | 
View definition:
 SELECT flowstimestamp.flow_id,
    flowstimestamp.sysuptime,
    flowstimestamp.exaddr,
    flowstimestamp.dpkts,
    flowstimestamp.doctets,
    flowstimestamp.first,
    flowstimestamp.last,
    flowstimestamp.srcaddr,
    flowstimestamp.dstaddr,
    flowstimestamp.srcport,
    flowstimestamp.dstport,
    flowstimestamp.prot,
    flowstimestamp.tos,
    flowstimestamp.tcp_flags,
    flowstimestamp."timestamp"
   FROM flowstimestamp;
mynetworks is a table having one column and 4 rows; it contains a list of our network networks:
flows=# select * from mynetworks;
     ipaddr     
----------------
(4 row)
flows=# \d+ mynetworks;
                    Table "public.mynetworks"
 Column | Type | Modifiers | Storage | Stats target | Description 
--------+------+-----------+---------+--------------+-------------
 ipaddr | ip4r |           | plain   |              | 
Indexes:
    "mynetworks_ipaddr_idx" gist (ipaddr)
dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.
flows=# \d+ dstexterne;
              View "public.dstexterne"
 Column  | Type | Modifiers | Storage | Description 
---------+------+-----------+---------+-------------
 dstaddr | ip4  |           | plain   | 
View definition:
 SELECT DISTINCT flowscompact.dstaddr
   FROM flowscompact
     LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r
  WHERE mynetworks.ipaddr IS NULL;
Thanks!

Charles

On Wed, Jul 12, 2017 at 6:39 PM, Igor Neyman <ineyman@perceptron.com> wrote:

 

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Charles Nadeau
Sent: Wednesday, July 12, 2017 6:05 AM
To: Jeff Janes <jeff.janes@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

 

flows=# explain (analyze, buffers) SELECT DISTINCT

flows-#    srcaddr,

flows-#    dstaddr,

flows-#    dstport,

flows-#    COUNT(*) AS conversation,

flows-#    SUM(doctets) / 1024 / 1024 AS mbytes 

flows-# FROM

flows-#    flowscompact,

flows-#    mynetworks 

flows-# WHERE

flows-#    mynetworks.ipaddr >>= flowscompact.srcaddr 

flows-#    AND dstaddr IN 

flows-#    (

flows(#       SELECT

flows(#          dstaddr 

flows(#       FROM

flows(#          dstexterne

flows(#    )

flows-# GROUP BY

flows-#    srcaddr,

flows-#    dstaddr,

flows-#    dstport 

flows-# ORDER BY

flows-#    mbytes DESC LIMIT 50;

LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.6", size 1073741824

LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.7", size 1073741824

LOG:  temporary file: path "pg_tblspc/36238/PG_9.6_201608131/pgsql_tmp/pgsql_tmp14573.8", size 639696896

LOG:  duration: 2765020.327 ms  statement: explain (analyze, buffers) SELECT DISTINCT

   srcaddr,

   dstaddr,

   dstport,

   COUNT(*) AS conversation,

   SUM(doctets) / 1024 / 1024 AS mbytes 

FROM

   flowscompact,

   mynetworks 

WHERE

   mynetworks.ipaddr >>= flowscompact.srcaddr 

   AND dstaddr IN 

   (

      SELECT

         dstaddr 

      FROM

         dstexterne

   )

GROUP BY

   srcaddr,

   dstaddr,

   dstport 

ORDER BY

   mbytes DESC LIMIT 50;

                                                                                                              QUERY PLAN                                                                                                              

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=37762321.83..37762321.98 rows=50 width=52) (actual time=2764548.863..2764548.891 rows=50 loops=1)

   Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

   I/O Timings: read=5323746.860

   ->  Unique  (cost=37762321.83..37769053.57 rows=2243913 width=52) (actual time=2764548.861..2764548.882 rows=50 loops=1)

         Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

         I/O Timings: read=5323746.860

         ->  Sort  (cost=37762321.83..37763443.79 rows=2243913 width=52) (actual time=2764548.859..2764548.872 rows=50 loops=1)

               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))

               Sort Method: quicksort  Memory: 563150kB

               Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

               I/O Timings: read=5323746.860

               ->  GroupAggregate  (cost=37698151.34..37714980.68 rows=2243913 width=52) (actual time=2696721.610..2752109.551 rows=4691734 loops=1)

                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport

                     Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

                     I/O Timings: read=5323746.860

                     ->  Sort  (cost=37698151.34..37699273.29 rows=2243913 width=20) (actual time=2696711.428..2732781.705 rows=81896988 loops=1)

                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport

                           Sort Method: external merge  Disk: 2721856kB

                           Buffers: shared hit=1116590560 read=15851133, temp read=340244 written=340244

                           I/O Timings: read=5323746.860

                           ->  Gather  (cost=19463936.00..37650810.19 rows=2243913 width=20) (actual time=1777219.713..2590530.887 rows=81896988 loops=1)

                                 Workers Planned: 9

                                 Workers Launched: 9

                                 Buffers: shared hit=1116590559 read=15851133

                                 I/O Timings: read=5323746.860

                                 ->  Hash Semi Join  (cost=19462936.00..37622883.23 rows=249324 width=20) (actual time=1847579.360..2602039.780 rows=8189699 loops=10)

                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)

                                       Buffers: shared hit=1116588309 read=15851133

                                       I/O Timings: read=5323746.860

                                       ->  Nested Loop  (cost=0.03..18159012.30 rows=249324 width=20) (actual time=1.562..736556.583 rows=45499045 loops=10)

                                             Buffers: shared hit=996551813 read=15851133

                                             I/O Timings: read=5323746.860

                                             ->  Parallel Seq Scan on flows  (cost=0.00..16039759.79 rows=62330930 width=20) (actual time=1.506..547485.066 rows=54155970 loops=10)

                                                   Buffers: shared hit=1634 read=15851133

                                                   I/O Timings: read=5323746.860

                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=541559704)

                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)

                                                   Heap Fetches: 59971474

                                                   Buffers: shared hit=996550152

                                       ->  Hash  (cost=19462896.74..19462896.74 rows=11210 width=4) (actual time=1847228.894..1847228.894 rows=3099798 loops=10)

                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 141746kB

                                             Buffers: shared hit=120036496

                                             ->  HashAggregate  (cost=19462829.48..19462863.11 rows=11210 width=4) (actual time=1230049.015..1845955.764 rows=3099798 loops=10)

                                                   Group Key: flows_1.dstaddr

                                                   Buffers: shared hit=120036496

                                                   ->  Nested Loop Anti Join  (cost=0.12..19182620.78 rows=560417390 width=4) (actual time=0.084..831832.333 rows=113420172 loops=10)

                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)

                                                         Rows Removed by Join Filter: 453681377

                                                         Buffers: shared hit=120036496

                                                         ->  Index Only Scan using flows_srcaddr_dstaddr_idx on flows flows_1  (cost=0.12..9091067.70 rows=560978368 width=4) (actual time=0.027..113052.437 rows=541559704 loops=10)

                                                               Heap Fetches: 91

                                                               Buffers: shared hit=120036459

                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=5415597040)

                                                               Buffers: shared hit=10

                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.007..0.008 rows=4 loops=10)

                                                                     Buffers: shared hit=10

 Planning time: 6.689 ms

 Execution time: 2764860.853 ms

(58 rows)

 

Regarding "Also using dstat I can see that iowait time is at about 25%", I don't think the server was doing anything else. If it is important, I can repeat the benchmarks.

Thanks!

 

Charles

 

Charles,

 

In your original posting I couldn’t find what value you set for temp_buffers.

Considering you have plenty of RAM, try setting temp_buffers=’6GB’ and then run ‘explain (analyze, buffers) select…’ in the same session. This should alleviate “disk sort’ problem.

 

Also, could you post the structure of flowscompact, mynetworks, and dstextern tables with all the indexes and number of rows. Actually, are they all – tables, or some of them – views?

 

Igor

 

 

Sorry, I misstated the parameter to change.

It is work_mem (not temp_buffers) you should try to increase to 6GB.

 

Igor

 

 




--

Re: [PERFORM] Very poor read performance, query independent

From
Igor Neyman
Date:

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Igor,

 

Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?

Here is the explain you requested with work_mem set to 6GB:

flows=# set work_mem='6GB';

SET

flows=# explain (analyze, buffers) SELECT DISTINCT

   srcaddr,

   dstaddr,

   dstport,

   COUNT(*) AS conversation,

   SUM(doctets) / 1024 / 1024 AS mbytes

FROM

   flowscompact,

   mynetworks

WHERE

   mynetworks.ipaddr >>= flowscompact.srcaddr

   AND dstaddr IN

   (

      SELECT

         dstaddr

      FROM

         dstexterne

   )

GROUP BY

   srcaddr,

   dstaddr,

   dstport

ORDER BY

   mbytes DESC LIMIT 50;

                                                                                           QUERY PLAN                                                                                           

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)

   Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

   ->  Unique  (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)

         Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

         ->  Sort  (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)

               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))

               Sort Method: quicksort  Memory: 654395kB

               Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

               ->  GroupAggregate  (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)

                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport

                     Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

                     ->  Sort  (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)

                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport

                           Sort Method: external merge  Disk: 3049216kB

                           Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

                           ->  Gather  (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)

                                 Workers Planned: 12

                                 Workers Launched: 12

                                 Buffers: shared hit=728798037 read=82974833

                                 ->  Hash Semi Join  (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)

                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)

                                       Buffers: shared hit=728795193 read=82974833

                                       ->  Nested Loop  (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)

                                             Buffers: shared hit=590692229 read=14991777

                                             ->  Parallel Seq Scan on flows  (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)

                                                   Buffers: shared hit=860990 read=14991777

                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)

                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)

                                                   Heap Fetches: 0

                                                   Buffers: shared hit=589831203

                                       ->  Hash  (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)

                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 166486kB

                                             Buffers: shared hit=138102964 read=67983056

                                             ->  HashAggregate  (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)

                                                   Group Key: flows_1.dstaddr

                                                   Buffers: shared hit=138102964 read=67983056

                                                   ->  Nested Loop Anti Join  (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)

                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)

                                                         Rows Removed by Join Filter: 503353617

                                                         Buffers: shared hit=138102964 read=67983056

                                                         ->  Seq Scan on flows flows_1  (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)

                                                               Buffers: shared hit=138102915 read=67983056

                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)

                                                               Buffers: shared hit=13

                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)

                                                                     Buffers: shared hit=13

 Planning time: 0.941 ms

 Execution time: 2228345.171 ms

(48 rows)

 

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query. 

 

flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.

flows=# \d+ flowscompact;

                        View "public.flowscompact"

  Column   |           Type           | Modifiers | Storage | Description 

-----------+--------------------------+-----------+---------+-------------

 flow_id   | bigint                   |           | plain   | 

 sysuptime | bigint                   |           | plain   | 

 exaddr    | ip4                      |           | plain   | 

 dpkts     | integer                  |           | plain   | 

 doctets   | bigint                   |           | plain   | 

 first     | bigint                   |           | plain   | 

 last      | bigint                   |           | plain   | 

 srcaddr   | ip4                      |           | plain   | 

 dstaddr   | ip4                      |           | plain   | 

 srcport   | integer                  |           | plain   | 

 dstport   | integer                  |           | plain   | 

 prot      | smallint                 |           | plain   | 

 tos       | smallint                 |           | plain   | 

 tcp_flags | smallint                 |           | plain   | 

 timestamp | timestamp with time zone |           | plain   | 

View definition:

 SELECT flowstimestamp.flow_id,

    flowstimestamp.sysuptime,

    flowstimestamp.exaddr,

    flowstimestamp.dpkts,

    flowstimestamp.doctets,

    flowstimestamp.first,

    flowstimestamp.last,

    flowstimestamp.srcaddr,

    flowstimestamp.dstaddr,

    flowstimestamp.srcport,

    flowstimestamp.dstport,

    flowstimestamp.prot,

    flowstimestamp.tos,

    flowstimestamp.tcp_flags,

    flowstimestamp."timestamp"

   FROM flowstimestamp;

mynetworks is a table having one column and 4 rows; it contains a list of our network networks:

flows=# select * from mynetworks;

     ipaddr     

----------------

(4 row)

flows=# \d+ mynetworks;

                    Table "public.mynetworks"

 Column | Type | Modifiers | Storage | Stats target | Description 

--------+------+-----------+---------+--------------+-------------

 ipaddr | ip4r |           | plain   |              | 

Indexes:

    "mynetworks_ipaddr_idx" gist (ipaddr)

dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.

flows=# \d+ dstexterne;

              View "public.dstexterne"

 Column  | Type | Modifiers | Storage | Description 

---------+------+-----------+---------+-------------

 dstaddr | ip4  |           | plain   | 

View definition:

 SELECT DISTINCT flowscompact.dstaddr

   FROM flowscompact

     LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r

  WHERE mynetworks.ipaddr IS NULL;

Thanks!

 

Charles

 

Charles,

 

Don’t change temp_buffers.

Try to increase work_mem even more, say work_mem=’12GB’, because it’s still using disk for sorting (starting around 20th minute as you noticed).

See if this:

“Sort Method: external merge  Disk: 3049216kB”

goes away.

Igor

 

Re: [PERFORM] Very poor read performance, query independent

From
Igor Neyman
Date:

 

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Friday, July 14, 2017 3:13 PM
To: Charles Nadeau <charles.nadeau@gmail.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Igor,

 

Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?

Here is the explain you requested with work_mem set to 6GB:

flows=# set work_mem='6GB';

SET

flows=# explain (analyze, buffers) SELECT DISTINCT

   srcaddr,

   dstaddr,

   dstport,

   COUNT(*) AS conversation,

   SUM(doctets) / 1024 / 1024 AS mbytes

FROM

   flowscompact,

   mynetworks

WHERE

   mynetworks.ipaddr >>= flowscompact.srcaddr

   AND dstaddr IN

   (

      SELECT

         dstaddr

      FROM

         dstexterne

   )

GROUP BY

   srcaddr,

   dstaddr,

   dstport

ORDER BY

   mbytes DESC LIMIT 50;

                                                                                           QUERY PLAN                                                                                           

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)

   Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

   ->  Unique  (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)

         Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

         ->  Sort  (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)

               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))

               Sort Method: quicksort  Memory: 654395kB

               Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

               ->  GroupAggregate  (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)

                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport

                     Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

                     ->  Sort  (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)

                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport

                           Sort Method: external merge  Disk: 3049216kB

                           Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

                           ->  Gather  (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)

                                 Workers Planned: 12

                                 Workers Launched: 12

                                 Buffers: shared hit=728798037 read=82974833

                                 ->  Hash Semi Join  (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)

                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)

                                       Buffers: shared hit=728795193 read=82974833

                                       ->  Nested Loop  (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)

                                             Buffers: shared hit=590692229 read=14991777

                                             ->  Parallel Seq Scan on flows  (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)

                                                   Buffers: shared hit=860990 read=14991777

                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)

                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)

                                                   Heap Fetches: 0

                                                   Buffers: shared hit=589831203

                                       ->  Hash  (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)

                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 166486kB

                                             Buffers: shared hit=138102964 read=67983056

                                             ->  HashAggregate  (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)

                                                   Group Key: flows_1.dstaddr

                                                   Buffers: shared hit=138102964 read=67983056

                                                   ->  Nested Loop Anti Join  (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)

                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)

                                                         Rows Removed by Join Filter: 503353617

                                                         Buffers: shared hit=138102964 read=67983056

                                                         ->  Seq Scan on flows flows_1  (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)

                                                               Buffers: shared hit=138102915 read=67983056

                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)

                                                               Buffers: shared hit=13

                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)

                                                                     Buffers: shared hit=13

 Planning time: 0.941 ms

 Execution time: 2228345.171 ms

(48 rows)

 

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query. 

 

flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.

flows=# \d+ flowscompact;

                        View "public.flowscompact"

  Column   |           Type           | Modifiers | Storage | Description 

-----------+--------------------------+-----------+---------+-------------

 flow_id   | bigint                   |           | plain   | 

 sysuptime | bigint                   |           | plain   | 

 exaddr    | ip4                      |           | plain   | 

 dpkts     | integer                  |           | plain   | 

 doctets   | bigint                   |           | plain   | 

 first     | bigint                   |           | plain   | 

 last      | bigint                   |           | plain   | 

 srcaddr   | ip4                      |           | plain   | 

 dstaddr   | ip4                      |           | plain   | 

 srcport   | integer                  |           | plain   | 

 dstport   | integer                  |           | plain   | 

 prot      | smallint                 |           | plain   | 

 tos       | smallint                 |           | plain   | 

 tcp_flags | smallint                 |           | plain   | 

 timestamp | timestamp with time zone |           | plain   | 

View definition:

 SELECT flowstimestamp.flow_id,

    flowstimestamp.sysuptime,

    flowstimestamp.exaddr,

    flowstimestamp.dpkts,

    flowstimestamp.doctets,

    flowstimestamp.first,

    flowstimestamp.last,

    flowstimestamp.srcaddr,

    flowstimestamp.dstaddr,

    flowstimestamp.srcport,

    flowstimestamp.dstport,

    flowstimestamp.prot,

    flowstimestamp.tos,

    flowstimestamp.tcp_flags,

    flowstimestamp."timestamp"

   FROM flowstimestamp;

mynetworks is a table having one column and 4 rows; it contains a list of our network networks:

flows=# select * from mynetworks;

     ipaddr     

----------------

(4 row)

flows=# \d+ mynetworks;

                    Table "public.mynetworks"

 Column | Type | Modifiers | Storage | Stats target | Description 

--------+------+-----------+---------+--------------+-------------

 ipaddr | ip4r |           | plain   |              | 

Indexes:

    "mynetworks_ipaddr_idx" gist (ipaddr)

dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.

flows=# \d+ dstexterne;

              View "public.dstexterne"

 Column  | Type | Modifiers | Storage | Description 

---------+------+-----------+---------+-------------

 dstaddr | ip4  |           | plain   | 

View definition:

 SELECT DISTINCT flowscompact.dstaddr

   FROM flowscompact

     LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r

  WHERE mynetworks.ipaddr IS NULL;

Thanks!

 

Charles

 

Charles,

 

Also, let’s try to simplify your query and see if it performs better.

You are grouping by srcaddr, dstaddr, dstport, that makes DISTINCT not needed.

And after simplifying WHERE clause (let me know if the result is not what you want), the query looks like:

 

SELECT srcaddr, dstaddr, dstport,

   COUNT(*) AS conversation,

   SUM(doctets) / 1024 / 1024 AS mbytes

FROM flowscompact

WHERE srcaddr IN (SELECT ipaddr FROM mynetworks)

     AND dstaddr NOT IN (SELECT ipaddr FROM mynetworks)

GROUP BY srcaddr, dstaddr, dstport

ORDER BY mbytes DESC

LIMIT 50;

 

Now, you didn’t provide the definition of flowstimestamp table.

If this table doesn’t have an index on (srcaddr, dstaddr, dstport) creating one should help (I think).

 

Igor

 

 

 

Re: [PERFORM] Very poor read performance, query independent

From
Mark Kirkwood
Date:
Ah yes - that seems more sensible (but still slower than I would expect
for 5 disks RAID 0). You should be able to get something like 5 *
(single disk speed) i.e about 500MB/s.

Might be worth increasing device read ahead (more than you have
already). Some of these so-called 'smart' RAID cards need to be hit over
the head before they will perform. E.g: I believe you have it set to 128
- I'd try 4096 or even 16384 (In the past I've used those settings on
some extremely stupid cards that refused to max out their disks known
speeds).

Also worth investigating is RAID stripe size - for DW work it makes
sense for it to be reasonably big (256K to 1M), which again will help
speed is sequential scans.

Cheers

Mark


On 15/07/17 02:09, Charles Nadeau wrote:
> Mark,
>
> First I must say that I changed my disks configuration from 4 disks in
> RAID 10 to 5 disks in RAID 0 because I almost ran out of disk space
> during the last ingest of data.
> Here is the result test you asked. It was done with a cold cache:
>
>     flows=# \timing
>     Timing is on.
>     flows=# explain select count(*) from flows;
>                                               QUERY PLAN
>     -----------------------------------------------------------------------------------------------
>      Finalize Aggregate  (cost=17214914.09..17214914.09 rows=1 width=8)
>        ->  Gather  (cost=17214914.07..17214914.09 rows=1 width=8)
>              Workers Planned: 1
>              ->  Partial Aggregate  (cost=17213914.07..17213914.07
>     rows=1 width=8)
>                    ->  Parallel Seq Scan on flows
>      (cost=0.00..17019464.49 rows=388899162 width=0)
>     (5 rows)
>
>     Time: 171.835 ms
>     flows=# select pg_relation_size('flows');
>      pg_relation_size
>     ------------------
>          129865867264
>     (1 row)
>
>     Time: 57.157 ms
>     flows=# select count(*) from flows;
>     LOG:  duration: 625546.522 ms  statement: select count(*) from flows;
>        count
>     -----------
>      589831190
>     (1 row)
>
>     Time: 625546.662 ms
>
> The throughput reported by Postgresql is almost 198MB/s, and the
> throughput as mesured by dstat during the query execution was between
> 25 and 299MB/s. It is much better than what I had before! The i/o wait
> was about 12% all through the query. One thing I noticed is the
> discrepency between the read throughput reported by pg_activity and
> the one reported by dstat: pg_activity always report a value lower
> than dstat.
>
> Besides the change of disks configuration, here is what contributed
> the most to the improvment of the performance so far:
>
>     Using Hugepage
>     Increasing effective_io_concurrency to 256
>     Reducing random_page_cost from 22 to 4
>     Reducing min_parallel_relation_size to 512kB to have more workers
>     when doing sequential parallel scan of my biggest table
>
>
> Thanks for recomending this test, I now know what the real throughput
> should be!
>
> Charles
>
> On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
> wrote:
>
>     Hmm - how are you measuring that sequential scan speed of 4MB/s?
>     I'd recommend doing a very simple test e.g, here's one on my
>     workstation - 13 GB single table on 1 SATA drive - cold cache
>     after reboot, sequential scan using Postgres 9.6.2:
>
>     bench=#  EXPLAIN SELECT count(*) FROM pgbench_accounts;
>                                          QUERY PLAN
>     ------------------------------------------------------------------------------------
>      Aggregate  (cost=2889345.00..2889345.01 rows=1 width=8)
>        ->  Seq Scan on pgbench_accounts (cost=0.00..2639345.00
>     rows=100000000 width=0)
>     (2 rows)
>
>
>     bench=#  SELECT pg_relation_size('pgbench_accounts');
>      pg_relation_size
>     ------------------
>           13429514240
>     (1 row)
>
>     bench=# SELECT count(*) FROM pgbench_accounts;
>        count
>     -----------
>      100000000
>     (1 row)
>
>     Time: 118884.277 ms
>
>
>     So doing the math seq read speed is about 110MB/s (i.e 13 GB in
>     120 sec). Sure enough, while I was running the query iostat showed:
>
>     Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s
>     avgrq-sz avgqu-sz   await r_await w_await  svctm %util
>     sda               0.00     0.00  926.00    0.00 114.89  0.00
>      254.10     1.90    2.03    2.03    0.00   1.08 100.00
>
>
>     So might be useful for us to see something like that from your
>     system - note you need to check you really have flushed the cache,
>     and that no other apps are using the db.
>
>     regards
>
>     Mark
>
>
>     On 12/07/17 00:46, Charles Nadeau wrote:
>
>         After reducing random_page_cost to 4 and testing more, I can
>         report that the aggregate read throughput for parallel
>         sequential scan is about 90MB/s. However the throughput for
>         sequential scan is still around 4MB/s.
>
>
>
>
>
> --
> Charles Nadeau Ph.D.
> http://charlesnadeau.blogspot.com/




Re: [PERFORM] Very poor read performance, query independent

From
Mark Kirkwood
Date:
Thinking about this a bit more - if somewhat more blazing performance is
needed, then this could be achieved via losing the RAID card and
spinning disks altogether and buying 1 of the NVME or SATA solid state
products: e.g

- Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds and
200K IOPS)

- Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)


The Samsung needs an M.2 port on the mobo (but most should have 'em -
and if not PCIe X4 adapter cards are quite cheap). The Intel is a bit
more expensive compared to the Samsung, and is slower but has a longer
lifetime. However for your workload the Sammy is probably fine.

regards

Mark

On 15/07/17 11:09, Mark Kirkwood wrote:
> Ah yes - that seems more sensible (but still slower than I would
> expect for 5 disks RAID 0).



Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Mark,

The server is a HP DL 380 G6. It doesn't really work with SATA drives. And when you find one that is compatible, it is only used at 3Gb/s with a maximum of 50000 IOPS (a well know caracteristic of the HP P410i SAS RAID controller). I am looking at getting a Kingston Digital HyperX Predator that I could use in one of the PCIe 2.0 x4 slot. However I am worried about the "thermal runaway", i.e. when the server can't get a temperature reading from a PCIe card, it spins the fans at full speed to protect the server against high temperature. The machine being next to my desk I worry about the deafening noise it will create.
Thanks!

Chales

On Sat, Jul 15, 2017 at 1:57 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
Thinking about this a bit more - if somewhat more blazing performance is needed, then this could be achieved via losing the RAID card and spinning disks altogether and buying 1 of the NVME or SATA solid state products: e.g

- Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds and 200K IOPS)

- Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)


The Samsung needs an M.2 port on the mobo (but most should have 'em - and if not PCIe X4 adapter cards are quite cheap). The Intel is a bit more expensive compared to the Samsung, and is slower but has a longer lifetime. However for your workload the Sammy is probably fine.

regards

Mark


On 15/07/17 11:09, Mark Kirkwood wrote:
Ah yes - that seems more sensible (but still slower than I would expect for 5 disks RAID 0).



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Mark,

I increased the read ahead to 16384 and it doesn't improve performance. My RAID 0 use a stripe size of 256k, the maximum size supported by the controller.
Thanks!

Charles

On Sat, Jul 15, 2017 at 1:02 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
Ah yes - that seems more sensible (but still slower than I would expect for 5 disks RAID 0). You should be able to get something like 5 * (single disk speed) i.e about 500MB/s.

Might be worth increasing device read ahead (more than you have already). Some of these so-called 'smart' RAID cards need to be hit over the head before they will perform. E.g: I believe you have it set to 128 - I'd try 4096 or even 16384 (In the past I've used those settings on some extremely stupid cards that refused to max out their disks known speeds).

Also worth investigating is RAID stripe size - for DW work it makes sense for it to be reasonably big (256K to 1M), which again will help speed is sequential scans.

Cheers

Mark

P.s I used to work for Greenplum, so this type of problem came up a lot :-) . The best cards were the LSI and Areca!



On 15/07/17 02:09, Charles Nadeau wrote:
Mark,

First I must say that I changed my disks configuration from 4 disks in RAID 10 to 5 disks in RAID 0 because I almost ran out of disk space during the last ingest of data.
Here is the result test you asked. It was done with a cold cache:

    flows=# \timing
    Timing is on.
    flows=# explain select count(*) from flows;
                                              QUERY PLAN
    -----------------------------------------------------------------------------------------------
     Finalize Aggregate  (cost=17214914.09..17214914.09 rows=1 width=8)
       ->  Gather  (cost=17214914.07..17214914.09 rows=1 width=8)
             Workers Planned: 1
             ->  Partial Aggregate  (cost=17213914.07..17213914.07
    rows=1 width=8)
                   ->  Parallel Seq Scan on flows
     (cost=0.00..17019464.49 rows=388899162 width=0)
    (5 rows)

    Time: 171.835 ms
    flows=# select pg_relation_size('flows');
     pg_relation_size
    ------------------
         129865867264
    (1 row)

    Time: 57.157 ms
    flows=# select count(*) from flows;
    LOG:  duration: 625546.522 ms  statement: select count(*) from flows;
       count
    -----------
     589831190
    (1 row)

    Time: 625546.662 ms

The throughput reported by Postgresql is almost 198MB/s, and the throughput as mesured by dstat during the query execution was between 25 and 299MB/s. It is much better than what I had before! The i/o wait was about 12% all through the query. One thing I noticed is the discrepency between the read throughput reported by pg_activity and the one reported by dstat: pg_activity always report a value lower than dstat.

Besides the change of disks configuration, here is what contributed the most to the improvment of the performance so far:

    Using Hugepage
    Increasing effective_io_concurrency to 256
    Reducing random_page_cost from 22 to 4
    Reducing min_parallel_relation_size to 512kB to have more workers
    when doing sequential parallel scan of my biggest table


Thanks for recomending this test, I now know what the real throughput should be!

Charles

On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> wrote:

    Hmm - how are you measuring that sequential scan speed of 4MB/s?
    I'd recommend doing a very simple test e.g, here's one on my
    workstation - 13 GB single table on 1 SATA drive - cold cache
    after reboot, sequential scan using Postgres 9.6.2:

    bench=#  EXPLAIN SELECT count(*) FROM pgbench_accounts;
                                         QUERY PLAN
    ------------------------------------------------------------------------------------
     Aggregate  (cost=2889345.00..2889345.01 rows=1 width=8)
       ->  Seq Scan on pgbench_accounts (cost=0.00..2639345.00
    rows=100000000 width=0)
    (2 rows)


    bench=#  SELECT pg_relation_size('pgbench_accounts');
     pg_relation_size
    ------------------
          13429514240
    (1 row)

    bench=# SELECT count(*) FROM pgbench_accounts;
       count
    -----------
     100000000
    (1 row)

    Time: 118884.277 ms


    So doing the math seq read speed is about 110MB/s (i.e 13 GB in
    120 sec). Sure enough, while I was running the query iostat showed:

    Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s
    avgrq-sz avgqu-sz   await r_await w_await  svctm %util
    sda               0.00     0.00  926.00    0.00 114.89  0.00      254.10     1.90    2.03    2.03    0.00   1.08 100.00


    So might be useful for us to see something like that from your
    system - note you need to check you really have flushed the cache,
    and that no other apps are using the db.

    regards

    Mark


    On 12/07/17 00:46, Charles Nadeau wrote:

        After reducing random_page_cost to 4 and testing more, I can
        report that the aggregate read throughput for parallel
        sequential scan is about 90MB/s. However the throughput for
        sequential scan is still around 4MB/s.





--
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/




--

Re: [PERFORM] Very poor read performance, query independent

From
Scott Marlowe
Date:
On Sat, Jul 15, 2017 at 11:53 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> Mark,
>
> I increased the read ahead to 16384 and it doesn't improve performance. My
> RAID 0 use a stripe size of 256k, the maximum size supported by the
> controller.

Are your queries still spilling to disk for sorts? If this is the
case, and they're just too big to fit in memory, then you need to move
your temp space, where sorts happen, onto another disk array that
isn't your poor overworked raid-10 array. Contention between sorts and
reads can kill performance quick, esp on spinning rust.


Re: [PERFORM] Very poor read performance, query independent

From
Mark Kirkwood
Date:
Right, that is a bit of a show stopper for those SSD (the Intel needs
SATA 6Gb/s and the Sammy's need PCIe 3.0 to perform to their rated specs).

regards

Mark


On 16/07/17 04:12, Charles Nadeau wrote:
> Mark,
>
> The server is a . It doesn't really work with SATA drives. And when
> you find one that is compatible, it is only used at 3Gb/s with a
> maximum of 50000 IOPS (a well know caracteristic of the HP P410i SAS
> RAID controller). I am looking at getting a Kingston Digital HyperX
> Predator that I could use in one of the PCIe 2.0 x4 slot. However I am
> worried about the "thermal runaway", i.e. when the server can't get a
> temperature reading from a PCIe card, it spins the fans at full speed
> to protect the server against high temperature. The machine being next
> to my desk I worry about the deafening noise it will create.
> Thanks!
>
> Chales
>
> On Sat, Jul 15, 2017 at 1:57 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
> wrote:
>
>     Thinking about this a bit more - if somewhat more blazing
>     performance is needed, then this could be achieved via losing the
>     RAID card and spinning disks altogether and buying 1 of the NVME
>     or SATA solid state products: e.g
>
>     - Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds
>     and 200K IOPS)
>
>     - Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)
>
>
>     The Samsung needs an M.2 port on the mobo (but most should have
>     'em - and if not PCIe X4 adapter cards are quite cheap). The Intel
>     is a bit more expensive compared to the Samsung, and is slower but
>     has a longer lifetime. However for your workload the Sammy is
>     probably fine.
>
>     regards
>
>     Mark
>
>     On 15/07/17 11:09, Mark Kirkwood wrote:
>
>         Ah yes - that seems more sensible (but still slower than I
>         would expect for 5 disks RAID 0).
>
>
>
>
>     --
>     Sent via pgsql-performance mailing list
>     (pgsql-performance@postgresql.org
>     <mailto:pgsql-performance@postgresql.org>)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-performance
>     <http://www.postgresql.org/mailpref/pgsql-performance>
>
>
>
>
> --
> Charles Nadeau Ph.D.
> http://charlesnadeau.blogspot.com/



Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Igor,

I set the work_mem to 12GB, restarted postrgresql, repeat the same "explain (analyze, buffers)..." as above and the read throughput was very low, at most 10MB/s. All the sorting operation are now done in memory.
I lowered the work_mem back to 6GB, restarted postrgresql, repeat the same "explain (analyze, buffers)..." as above and the read throughput was very low, at most 10MB/s. The 1st sorting operation is still done in memory, the second one to disk. I think I need about 4GB to do all sort in memory.
One thing I remember from Friday's "explain (analyze, buffers)...". I set temp_buffer and work_mem to 6GB as I read both your message one after the other. So I decided to try again: I then set work_mem=6GB, temp_buffers=6GB, restarted postrgresql, repeat the same "explain (analyze, buffers)..." as above and the read throughput was very low again, at most 10MB/s. The 1st sorting operation is still done in memory, the second one to disk.
For the last test, I brought back work_mem and temp_buffer to their original value. The read throughput is still low.
In all cases, about 20 minutes after the query starts, it start writing to disk furiously. Here are the peak values as reported by dstat:
work_mem=12GB, temp_buffers=8MB: peak of 393MB/s
work_mem=6GB, temp_buffers=8MB: peak of 579MB/s
work_mem=6GB, temp_buffers=6GB: peak of 418MB/s
work_mem=1468006kB and temp_buffers=8MB, peak of 61MB/s
Also, at peak write, the server alomost ran of memory: the cache almost goes to 0 and it starts swapping.

This query is a bit extreme in terms of sorting. Maybe I should try to benchmark while counting all the records of my biggest table like Mark Kirkwood suggested. I'll do some more tests and post the results back to the mailing list.
Thanks!

Charles

On Fri, Jul 14, 2017 at 9:13 PM, Igor Neyman <ineyman@perceptron.com> wrote:

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Igor,

 

Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?

Here is the explain you requested with work_mem set to 6GB:

flows=# set work_mem='6GB';

SET

flows=# explain (analyze, buffers) SELECT DISTINCT

   srcaddr,

   dstaddr,

   dstport,

   COUNT(*) AS conversation,

   SUM(doctets) / 1024 / 1024 AS mbytes

FROM

   flowscompact,

   mynetworks

WHERE

   mynetworks.ipaddr >>= flowscompact.srcaddr

   AND dstaddr IN

   (

      SELECT

         dstaddr

      FROM

         dstexterne

   )

GROUP BY

   srcaddr,

   dstaddr,

   dstport

ORDER BY

   mbytes DESC LIMIT 50;

                                                                                           QUERY PLAN                                                                                           

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)

   Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

   ->  Unique  (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)

         Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

         ->  Sort  (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)

               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))

               Sort Method: quicksort  Memory: 654395kB

               Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

               ->  GroupAggregate  (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)

                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport

                     Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

                     ->  Sort  (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)

                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport

                           Sort Method: external merge  Disk: 3049216kB

                           Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

                           ->  Gather  (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)

                                 Workers Planned: 12

                                 Workers Launched: 12

                                 Buffers: shared hit=728798037 read=82974833

                                 ->  Hash Semi Join  (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)

                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)

                                       Buffers: shared hit=728795193 read=82974833

                                       ->  Nested Loop  (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)

                                             Buffers: shared hit=590692229 read=14991777

                                             ->  Parallel Seq Scan on flows  (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)

                                                   Buffers: shared hit=860990 read=14991777

                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)

                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)

                                                   Heap Fetches: 0

                                                   Buffers: shared hit=589831203

                                       ->  Hash  (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)

                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 166486kB

                                             Buffers: shared hit=138102964 read=67983056

                                             ->  HashAggregate  (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)

                                                   Group Key: flows_1.dstaddr

                                                   Buffers: shared hit=138102964 read=67983056

                                                   ->  Nested Loop Anti Join  (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)

                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)

                                                         Rows Removed by Join Filter: 503353617

                                                         Buffers: shared hit=138102964 read=67983056

                                                         ->  Seq Scan on flows flows_1  (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)

                                                               Buffers: shared hit=138102915 read=67983056

                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)

                                                               Buffers: shared hit=13

                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)

                                                                     Buffers: shared hit=13

 Planning time: 0.941 ms

 Execution time: 2228345.171 ms

(48 rows)

 

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query. 

 

flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.

flows=# \d+ flowscompact;

                        View "public.flowscompact"

  Column   |           Type           | Modifiers | Storage | Description 

-----------+--------------------------+-----------+---------+-------------

 flow_id   | bigint                   |           | plain   | 

 sysuptime | bigint                   |           | plain   | 

 exaddr    | ip4                      |           | plain   | 

 dpkts     | integer                  |           | plain   | 

 doctets   | bigint                   |           | plain   | 

 first     | bigint                   |           | plain   | 

 last      | bigint                   |           | plain   | 

 srcaddr   | ip4                      |           | plain   | 

 dstaddr   | ip4                      |           | plain   | 

 srcport   | integer                  |           | plain   | 

 dstport   | integer                  |           | plain   | 

 prot      | smallint                 |           | plain   | 

 tos       | smallint                 |           | plain   | 

 tcp_flags | smallint                 |           | plain   | 

 timestamp | timestamp with time zone |           | plain   | 

View definition:

 SELECT flowstimestamp.flow_id,

    flowstimestamp.sysuptime,

    flowstimestamp.exaddr,

    flowstimestamp.dpkts,

    flowstimestamp.doctets,

    flowstimestamp.first,

    flowstimestamp.last,

    flowstimestamp.srcaddr,

    flowstimestamp.dstaddr,

    flowstimestamp.srcport,

    flowstimestamp.dstport,

    flowstimestamp.prot,

    flowstimestamp.tos,

    flowstimestamp.tcp_flags,

    flowstimestamp."timestamp"

   FROM flowstimestamp;

mynetworks is a table having one column and 4 rows; it contains a list of our network networks:

flows=# select * from mynetworks;

     ipaddr     

----------------

(4 row)

flows=# \d+ mynetworks;

                    Table "public.mynetworks"

 Column | Type | Modifiers | Storage | Stats target | Description 

--------+------+-----------+---------+--------------+-------------

 ipaddr | ip4r |           | plain   |              | 

Indexes:

    "mynetworks_ipaddr_idx" gist (ipaddr)

dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.

flows=# \d+ dstexterne;

              View "public.dstexterne"

 Column  | Type | Modifiers | Storage | Description 

---------+------+-----------+---------+-------------

 dstaddr | ip4  |           | plain   | 

View definition:

 SELECT DISTINCT flowscompact.dstaddr

   FROM flowscompact

     LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r

  WHERE mynetworks.ipaddr IS NULL;

Thanks!

 

Charles

 

Charles,

 

Don’t change temp_buffers.

Try to increase work_mem even more, say work_mem=’12GB’, because it’s still using disk for sorting (starting around 20th minute as you noticed).

See if this:

“Sort Method: external merge  Disk: 3049216kB”

goes away.

Igor

 




--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Scott,

The temp tablespace is on a disk of his own.
Thanks!

Charles

On Sat, Jul 15, 2017 at 7:58 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Sat, Jul 15, 2017 at 11:53 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> Mark,
>
> I increased the read ahead to 16384 and it doesn't improve performance. My
> RAID 0 use a stripe size of 256k, the maximum size supported by the
> controller.

Are your queries still spilling to disk for sorts? If this is the
case, and they're just too big to fit in memory, then you need to move
your temp space, where sorts happen, onto another disk array that
isn't your poor overworked raid-10 array. Contention between sorts and
reads can kill performance quick, esp on spinning rust.



--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Igor,

The 1st clause of the where statement won't select addresses the same way as the one I wrote using the extension for IPv6 and IPv6 data types.

flowstimestamp is a view:
flows=# \d+ flowstimestamp 
                        View "public.flowstimestamp"
   Column    |           Type           | Modifiers | Storage | Description 
-------------+--------------------------+-----------+---------+-------------
 flow_id     | bigint                   |           | plain   | 
 unix_secs   | bigint                   |           | plain   | 
 unix_nsecs  | bigint                   |           | plain   | 
 sysuptime   | bigint                   |           | plain   | 
 exaddr      | ip4                      |           | plain   | 
 dpkts       | integer                  |           | plain   | 
 doctets     | bigint                   |           | plain   | 
 first       | bigint                   |           | plain   | 
 last        | bigint                   |           | plain   | 
 engine_type | smallint                 |           | plain   | 
 engine_id   | smallint                 |           | plain   | 
 srcaddr     | ip4                      |           | plain   | 
 dstaddr     | ip4                      |           | plain   | 
 nexthop     | ip4                      |           | plain   | 
 input       | integer                  |           | plain   | 
 output      | integer                  |           | plain   | 
 srcport     | integer                  |           | plain   | 
 dstport     | integer                  |           | plain   | 
 prot        | smallint                 |           | plain   | 
 tos         | smallint                 |           | plain   | 
 tcp_flags   | smallint                 |           | plain   | 
 src_mask    | smallint                 |           | plain   | 
 dst_mask    | smallint                 |           | plain   | 
 src_as      | integer                  |           | plain   | 
 dst_as      | integer                  |           | plain   | 
 timestamp   | timestamp with time zone |           | plain   | 
View definition:
 SELECT flows.flow_id,
    flows.unix_secs,
    flows.unix_nsecs,
    flows.sysuptime,
    flows.exaddr,
    flows.dpkts,
    flows.doctets,
    flows.first,
    flows.last,
    flows.engine_type,
    flows.engine_id,
    flows.srcaddr,
    flows.dstaddr,
    flows.nexthop,
    flows.input,
    flows.output,
    flows.srcport,
    flows.dstport,
    flows.prot,
    flows.tos,
    flows.tcp_flags,
    flows.src_mask,
    flows.dst_mask,
    flows.src_as,
    flows.dst_as,
    to_timestamp((flows.unix_secs + flows.unix_nsecs / 1000000000)::double precision) AS "timestamp"
   FROM flows;

And it can use the indexes of flows:
Indexes:
    "flows_pkey" PRIMARY KEY, btree (flow_id)
    "flows_dstaddr_dstport" btree (dstaddr, dstport)
    "flows_srcaddr_dstaddr_idx" btree (srcaddr, dstaddr)
    "flows_srcaddr_srcport" btree (srcaddr, srcport)
    "flows_srcport_dstport_idx" btree (srcport, dstport)

Thanks!

Charles

On Fri, Jul 14, 2017 at 10:18 PM, Igor Neyman <ineyman@perceptron.com> wrote:

 

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Friday, July 14, 2017 3:13 PM
To: Charles Nadeau <charles.nadeau@gmail.com>


Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

From: Charles Nadeau [mailto:charles.nadeau@gmail.com]
Sent: Friday, July 14, 2017 11:35 AM
To: Igor Neyman <ineyman@perceptron.com>
Cc: Jeff Janes <jeff.janes@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Very poor read performance, query independent

 

Igor,

 

Initially temp_buffer was left to its default value (8MB). Watching the content of the directory that stores the temporary files, I found that I need at most 21GB of temporary files space. Should I set temp_buffer to 21GB?

Here is the explain you requested with work_mem set to 6GB:

flows=# set work_mem='6GB';

SET

flows=# explain (analyze, buffers) SELECT DISTINCT

   srcaddr,

   dstaddr,

   dstport,

   COUNT(*) AS conversation,

   SUM(doctets) / 1024 / 1024 AS mbytes

FROM

   flowscompact,

   mynetworks

WHERE

   mynetworks.ipaddr >>= flowscompact.srcaddr

   AND dstaddr IN

   (

      SELECT

         dstaddr

      FROM

         dstexterne

   )

GROUP BY

   srcaddr,

   dstaddr,

   dstport

ORDER BY

   mbytes DESC LIMIT 50;

                                                                                           QUERY PLAN                                                                                           

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=48135680.07..48135680.22 rows=50 width=52) (actual time=2227678.196..2227678.223 rows=50 loops=1)

   Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

   ->  Unique  (cost=48135680.07..48143613.62 rows=2644514 width=52) (actual time=2227678.194..2227678.217 rows=50 loops=1)

         Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

         ->  Sort  (cost=48135680.07..48137002.33 rows=2644514 width=52) (actual time=2227678.192..2227678.202 rows=50 loops=1)

               Sort Key: (((sum(flows.doctets) / '1024'::numeric) / '1024'::numeric)) DESC, flows.srcaddr, flows.dstaddr, flows.dstport, (count(*))

               Sort Method: quicksort  Memory: 654395kB

               Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

               ->  GroupAggregate  (cost=48059426.65..48079260.50 rows=2644514 width=52) (actual time=2167909.030..2211446.192 rows=5859671 loops=1)

                     Group Key: flows.srcaddr, flows.dstaddr, flows.dstport

                     Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

                     ->  Sort  (cost=48059426.65..48060748.90 rows=2644514 width=20) (actual time=2167896.815..2189107.205 rows=91745640 loops=1)

                           Sort Key: flows.srcaddr, flows.dstaddr, flows.dstport

                           Sort Method: external merge  Disk: 3049216kB

                           Buffers: shared hit=728798038 read=82974833, temp read=381154 written=381154

                           ->  Gather  (cost=30060688.07..48003007.07 rows=2644514 width=20) (actual time=1268989.000..1991357.232 rows=91745640 loops=1)

                                 Workers Planned: 12

                                 Workers Launched: 12

                                 Buffers: shared hit=728798037 read=82974833

                                 ->  Hash Semi Join  (cost=30059688.07..47951761.31 rows=220376 width=20) (actual time=1268845.181..2007864.725 rows=7057357 loops=13)

                                       Hash Cond: (flows.dstaddr = flows_1.dstaddr)

                                       Buffers: shared hit=728795193 read=82974833

                                       ->  Nested Loop  (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283 rows=37910370 loops=13)

                                             Buffers: shared hit=590692229 read=14991777

                                             ->  Parallel Seq Scan on flows  (cost=0.00..16018049.14 rows=55094048 width=20) (actual time=0.152..566179.117 rows=45371630 loops=13)

                                                   Buffers: shared hit=860990 read=14991777

                                             ->  Index Only Scan using mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=589831190)

                                                   Index Cond: (ipaddr >>= (flows.srcaddr)::ip4r)

                                                   Heap Fetches: 0

                                                   Buffers: shared hit=589831203

                                       ->  Hash  (cost=30059641.47..30059641.47 rows=13305 width=4) (actual time=1268811.101..1268811.101 rows=3803508 loops=13)

                                             Buckets: 4194304 (originally 16384)  Batches: 1 (originally 1)  Memory Usage: 166486kB

                                             Buffers: shared hit=138102964 read=67983056

                                             ->  HashAggregate  (cost=30059561.64..30059601.56 rows=13305 width=4) (actual time=1265248.165..1267432.083 rows=3803508 loops=13)

                                                   Group Key: flows_1.dstaddr

                                                   Buffers: shared hit=138102964 read=67983056

                                                   ->  Nested Loop Anti Join  (cost=0.00..29729327.92 rows=660467447 width=4) (actual time=0.389..1201072.707 rows=125838232 loops=13)

                                                         Join Filter: (mynetworks_1.ipaddr >> (flows_1.dstaddr)::ip4r)

                                                         Rows Removed by Join Filter: 503353617

                                                         Buffers: shared hit=138102964 read=67983056

                                                         ->  Seq Scan on flows flows_1  (cost=0.00..17836152.73 rows=661128576 width=4) (actual time=0.322..343152.274 rows=589831190 loops=13)

                                                               Buffers: shared hit=138102915 read=67983056

                                                         ->  Materialize  (cost=0.00..1.02 rows=4 width=8) (actual time=0.000..0.000 rows=2 loops=7667805470)

                                                               Buffers: shared hit=13

                                                               ->  Seq Scan on mynetworks mynetworks_1  (cost=0.00..1.01 rows=4 width=8) (actual time=0.006..0.007 rows=4 loops=13)

                                                                     Buffers: shared hit=13

 Planning time: 0.941 ms

 Execution time: 2228345.171 ms

(48 rows)

 

With a work_mem at 6GB, I noticed that for the first 20 minutes the query was running, the i/o wait was much lower, hovering aroun 3% then it jumped 45% until almost the end of the query. 

 

flowscompact and dstexterne are actually views. I use views to simplify query writing and to "abstract" queries that are use often in other queries. flowscompact is a view built on table flows (having about 590 million rows), it only keeps the most often used fields.

flows=# \d+ flowscompact;

                        View "public.flowscompact"

  Column   |           Type           | Modifiers | Storage | Description 

-----------+--------------------------+-----------+---------+-------------

 flow_id   | bigint                   |           | plain   | 

 sysuptime | bigint                   |           | plain   | 

 exaddr    | ip4                      |           | plain   | 

 dpkts     | integer                  |           | plain   | 

 doctets   | bigint                   |           | plain   | 

 first     | bigint                   |           | plain   | 

 last      | bigint                   |           | plain   | 

 srcaddr   | ip4                      |           | plain   | 

 dstaddr   | ip4                      |           | plain   | 

 srcport   | integer                  |           | plain   | 

 dstport   | integer                  |           | plain   | 

 prot      | smallint                 |           | plain   | 

 tos       | smallint                 |           | plain   | 

 tcp_flags | smallint                 |           | plain   | 

 timestamp | timestamp with time zone |           | plain   | 

View definition:

 SELECT flowstimestamp.flow_id,

    flowstimestamp.sysuptime,

    flowstimestamp.exaddr,

    flowstimestamp.dpkts,

    flowstimestamp.doctets,

    flowstimestamp.first,

    flowstimestamp.last,

    flowstimestamp.srcaddr,

    flowstimestamp.dstaddr,

    flowstimestamp.srcport,

    flowstimestamp.dstport,

    flowstimestamp.prot,

    flowstimestamp.tos,

    flowstimestamp.tcp_flags,

    flowstimestamp."timestamp"

   FROM flowstimestamp;

mynetworks is a table having one column and 4 rows; it contains a list of our network networks:

flows=# select * from mynetworks;

     ipaddr     

----------------

(4 row)

flows=# \d+ mynetworks;

                    Table "public.mynetworks"

 Column | Type | Modifiers | Storage | Stats target | Description 

--------+------+-----------+---------+--------------+-------------

 ipaddr | ip4r |           | plain   |              | 

Indexes:

    "mynetworks_ipaddr_idx" gist (ipaddr)

dstexterne is a view listing all the destination IPv4 addresses not inside our network; it has one column and 3.8 million rows.

flows=# \d+ dstexterne;

              View "public.dstexterne"

 Column  | Type | Modifiers | Storage | Description 

---------+------+-----------+---------+-------------

 dstaddr | ip4  |           | plain   | 

View definition:

 SELECT DISTINCT flowscompact.dstaddr

   FROM flowscompact

     LEFT JOIN mynetworks ON mynetworks.ipaddr >> flowscompact.dstaddr::ip4r

  WHERE mynetworks.ipaddr IS NULL;

Thanks!

 

Charles

 

Charles,

 

Also, let’s try to simplify your query and see if it performs better.

You are grouping by srcaddr, dstaddr, dstport, that makes DISTINCT not needed.

And after simplifying WHERE clause (let me know if the result is not what you want), the query looks like:

 

SELECT srcaddr, dstaddr, dstport,

   COUNT(*) AS conversation,

   SUM(doctets) / 1024 / 1024 AS mbytes

FROM flowscompact

WHERE srcaddr IN (SELECT ipaddr FROM mynetworks)

     AND dstaddr NOT IN (SELECT ipaddr FROM mynetworks)

GROUP BY srcaddr, dstaddr, dstport

ORDER BY mbytes DESC

LIMIT 50;

 

Now, you didn’t provide the definition of flowstimestamp table.

If this table doesn’t have an index on (srcaddr, dstaddr, dstport) creating one should help (I think).

 

Igor

 

 

 




--

Re: [PERFORM] Very poor read performance, query independent

From
Claudio Freire
Date:
On Fri, Jul 14, 2017 at 12:34 PM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
>                                  Workers Planned: 12
>                                  Workers Launched: 12
>                                  Buffers: shared hit=728798037 read=82974833
>                                  ->  Hash Semi Join
> (cost=30059688.07..47951761.31 rows=220376 width=20) (actual
> time=1268845.181..2007864.725 rows=7057357 loops=13)
>                                        Hash Cond: (flows.dstaddr =
> flows_1.dstaddr)
>                                        Buffers: shared hit=728795193
> read=82974833
>                                        ->  Nested Loop
> (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283
> rows=37910370 loops=13)
>                                              Buffers: shared hit=590692229
> read=14991777
>                                              ->  Parallel Seq Scan on flows
> (cost=0.00..16018049.14 rows=55094048 width=20) (actual
> time=0.152..566179.117 rows=45371630 loops=13)
>                                                    Buffers: shared
> hit=860990 read=14991777
>                                              ->  Index Only Scan using
> mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8)
> (actual time=0.002..0.002 rows=1 loops=589831190)
>                                                    Index Cond: (ipaddr >>=
> (flows.srcaddr)::ip4r)
>                                                    Heap Fetches: 0
>                                                    Buffers: shared
> hit=589831203

12 workers on a parallel sequential scan on a RAID-10 volume of
rotating disks may not be a good idea.

Have you measured average request size and average wait times with iostat?

Run "iostat -x -m -d 60" while running the query and copy a few
relevant lines (or attach the whole thing). I suspect 12 parallel
sequential scans are degrading your array's performance to random I/O
performance, and that explains the 10MB/s very well (a rotating disk
will give you about 3-4MB/s at random I/O, and you've got 2 mirrors on
that array).

You could try setting the max_parallel_workers_per_gather to 2, which
should be the optimum allocation for your I/O layout.

You might also want to test switching to the deadline scheduler. While
the controller may get more aggregate thoughput rearranging your I/O
requests, high I/O latency will severly reduce postgres' ability to
saturate the I/O system itself, and deadlines tends to minimize
latency. I've had good results in the past using deadline, but take
this suggestion with a grain of salt, YMMV.


Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Claudio,

Find attached the iostat measured while redoing the query above (iostat1.txt). sda holds my temp directory (noop i/o scheduler), sdb the swap partition (cfq i/o scheduler) only and sdc (5 disks RAID0, noop i/o scheduler) holds the data. I didn't pay attention to the load caused by 12 parallel scans as I thought the RAID card would be smart enough to re-arrange the read requests optimally regardless of the load. At one moment during the query, there is a write storm to the swap drive (a bit like this case: https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com). I can hardly explain it as there is plenty of memory on this server. The execution time of the query was 4801.1s (about 1h20min).
I reduced max_parallel_workers_per_gather to 2 and max_parallel_workers to 3, restarted postgresql then ran the query again while running iostat again (iostat2.txt): The query ran much faster, 1992.8s (about 33min) instead of 4801.1s (about 1h20min) and the swap storm is gone! You were right about the max_parallel_workers_per_gather!!
For the last test, I changed the scheduler on sdc to deadline (iostat3.txt) keeping max_parallel_workers_per_gather=2 and max_parallel_workers=3 then restarted postgresql. The execution time is almost the same: 1938.7s vs 1992.8s for the noop scheduler.

Thanks a lot for the suggestion, I'll keep my number of worker low to make sure I maximize my array usage.

Charles

On Mon, Jul 17, 2017 at 10:56 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Fri, Jul 14, 2017 at 12:34 PM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
>                                  Workers Planned: 12
>                                  Workers Launched: 12
>                                  Buffers: shared hit=728798037 read=82974833
>                                  ->  Hash Semi Join
> (cost=30059688.07..47951761.31 rows=220376 width=20) (actual
> time=1268845.181..2007864.725 rows=7057357 loops=13)
>                                        Hash Cond: (flows.dstaddr =
> flows_1.dstaddr)
>                                        Buffers: shared hit=728795193
> read=82974833
>                                        ->  Nested Loop
> (cost=0.03..17891246.86 rows=220376 width=20) (actual time=0.207..723790.283
> rows=37910370 loops=13)
>                                              Buffers: shared hit=590692229
> read=14991777
>                                              ->  Parallel Seq Scan on flows
> (cost=0.00..16018049.14 rows=55094048 width=20) (actual
> time=0.152..566179.117 rows=45371630 loops=13)
>                                                    Buffers: shared
> hit=860990 read=14991777
>                                              ->  Index Only Scan using
> mynetworks_ipaddr_idx on mynetworks  (cost=0.03..0.03 rows=1 width=8)
> (actual time=0.002..0.002 rows=1 loops=589831190)
>                                                    Index Cond: (ipaddr >>=
> (flows.srcaddr)::ip4r)
>                                                    Heap Fetches: 0
>                                                    Buffers: shared
> hit=589831203

12 workers on a parallel sequential scan on a RAID-10 volume of
rotating disks may not be a good idea.

Have you measured average request size and average wait times with iostat?

Run "iostat -x -m -d 60" while running the query and copy a few
relevant lines (or attach the whole thing). I suspect 12 parallel
sequential scans are degrading your array's performance to random I/O
performance, and that explains the 10MB/s very well (a rotating disk
will give you about 3-4MB/s at random I/O, and you've got 2 mirrors on
that array).

You could try setting the max_parallel_workers_per_gather to 2, which
should be the optimum allocation for your I/O layout.

You might also want to test switching to the deadline scheduler. While
the controller may get more aggregate thoughput rearranging your I/O
requests, high I/O latency will severly reduce postgres' ability to
saturate the I/O system itself, and deadlines tends to minimize
latency. I've had good results in the past using deadline, but take
this suggestion with a grain of salt, YMMV.



--
Attachment

Re: [PERFORM] Very poor read performance, query independent

From
Claudio Freire
Date:
On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> Claudio,
>
> At one moment
> during the query, there is a write storm to the swap drive (a bit like this
> case:
> https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com).
> I can hardly explain it as there is plenty of memory on this server.

That sounds a lot like NUMA zone_reclaim issues:

https://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com


Re: [PERFORM] Very poor read performance, query independent

From
Claudio Freire
Date:
On Tue, Jul 18, 2017 at 1:01 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau
> <charles.nadeau@gmail.com> wrote:
>> Claudio,
>>
>> At one moment
>> during the query, there is a write storm to the swap drive (a bit like this
>> case:
>> https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com).
>> I can hardly explain it as there is plenty of memory on this server.
>
> That sounds a lot like NUMA zone_reclaim issues:
>
> https://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com

I realize you have zone_reclaim_mode set to 0. Still, the symptoms are
eerily similar.


Re: [PERFORM] Very poor read performance, query independent

From
Justin Pryzby
Date:
On Tue, Jul 18, 2017 at 02:13:58PM -0300, Claudio Freire wrote:
> On Tue, Jul 18, 2017 at 1:01 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau
> > <charles.nadeau@gmail.com> wrote:
> >> Claudio,
> >>
> >> At one moment
> >> during the query, there is a write storm to the swap drive (a bit like this
> >> case:
> >> https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com).
> >> I can hardly explain it as there is plenty of memory on this server.
> >
> > That sounds a lot like NUMA zone_reclaim issues:
> >
> > https://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com
>
> I realize you have zone_reclaim_mode set to 0. Still, the symptoms are
> eerily similar.

Did you look at disabling KSM and/or THP ?

sudo sh -c 'echo 2 >/sys/kernel/mm/ksm/run'

https://www.postgresql.org/message-id/20170524155855.GH31097%40telsasoft.com
https://www.postgresql.org/message-id/CANQNgOrD02f8mR3Y8Pi=zFsoL14RqNQA8hwz1r4rSnDLr1b2Cw@mail.gmail.com
https://www.postgresql.org/message-id/CAHyXU0y9hviyKWvQZxX5UWfH9M2LYvwvAOPQ_DUPva2b71t12g%40mail.gmail.com
https://www.postgresql.org/message-id/20130716195834.8fe5c79249cb2ff0d4270b3e@yahoo.es
https://www.postgresql.org/message-id/CAE_gQfW3dBiELcOppYN6v%3D8%2B%2BpEeywD7iXGw-OT3doB8SXO4_A%40mail.gmail.com

https://www.postgresql.org/message-id/flat/1436268563235-5856914.post%40n5.nabble.com#1436268563235-5856914.post@n5.nabble.com
https://www.postgresql.org/message-id/CAL_0b1tJOZCx3Lo3Eve1RqGaT%2BJJ_Q7w4pkJ87WfWwXbTugnxw@mail.gmail.com
https://www.postgresql.org/message-id/556E2068.7070007@vuole.me
https://www.postgresql.org/message-id/1415981309.90631.YahooMailNeo%40web133205.mail.ir2.yahoo.com
https://www.postgresql.org/message-id/CAHyXU0yXYpCXN4%3D81ZDRQu-oGzrcq2qNAXDpyz4oiQPPAGk4ew%40mail.gmail.com
https://www.pythian.com/blog/performance-tuning-hugepages-in-linux/
http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/

Justin


Re: [PERFORM] Very poor read performance, query independent

From
Scott Marlowe
Date:
On Tue, Jul 18, 2017 at 3:20 AM, Charles Nadeau
<charles.nadeau@gmail.com> wrote:
> Claudio,
>
> Find attached the iostat measured while redoing the query above
> (iostat1.txt). sda holds my temp directory (noop i/o scheduler), sdb the
> swap partition (cfq i/o scheduler) only and sdc (5 disks RAID0, noop i/o
> scheduler) holds the data. I didn't pay attention to the load caused by 12
> parallel scans as I thought the RAID card would be smart enough to
> re-arrange the read requests optimally regardless of the load. At one moment
> during the query, there is a write storm to the swap drive (a bit like this
> case:
> https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com).

My experience from that case (and  few more) has led me to believe
that Linux database servers with plenty of memory should have their
swaps turned off. The Linux kernel works hard to swap out little used
memory to make more space for caching active data.

Problem is that whatever decides to swap stuff out gets stupid when
presented with 512GB RAM and starts swapping out things like sys v
shared_buffers etc.

Here's the thing, either your memory is big enough to buffer your
whole data set, so nothing should get swapped out to make room for
caching.

OR your dataset is much bigger than memory. In which case, making more
room gets very little if it comes at the cost of waiting for stuff you
need to get read back in.

Linux servers should also have zone reclaim turned off, and THP disabled.

Try running "sudo swapoff -a" and see if it gets rid of your swap storms.


Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Justin,

Thanks for the extensive reading list, very educative.

After reading https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ I was thinking that it could be a NUMA/THP-related problem.
Turning off THP solved the "swap storm" problem. Some queries are even 40% faster with THP off.
Then also turning off KSM improved performance by another 5%
I was seriously worried about this issue as we received today another server with 144GB of RAM.

I will try to post a little summary of all the suggestion I received via this thread later this week/early next week.

Thanks!

Charles

On Tue, Jul 18, 2017 at 8:01 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Jul 18, 2017 at 02:13:58PM -0300, Claudio Freire wrote:
> On Tue, Jul 18, 2017 at 1:01 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> > On Tue, Jul 18, 2017 at 6:20 AM, Charles Nadeau
> > <charles.nadeau@gmail.com> wrote:
> >> Claudio,
> >>
> >> At one moment
> >> during the query, there is a write storm to the swap drive (a bit like this
> >> case:
> >> https://www.postgresql.org/message-id/AANLkTi%3Diw4fC2RgTxhw0aGpyXANhOT%3DXBnjLU1_v6PdA%40mail.gmail.com).
> >> I can hardly explain it as there is plenty of memory on this server.
> >
> > That sounds a lot like NUMA zone_reclaim issues:
> >
> > https://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com
>
> I realize you have zone_reclaim_mode set to 0. Still, the symptoms are
> eerily similar.

Did you look at disabling KSM and/or THP ?

sudo sh -c 'echo 2 >/sys/kernel/mm/ksm/run'

https://www.postgresql.org/message-id/20170524155855.GH31097%40telsasoft.com
https://www.postgresql.org/message-id/CANQNgOrD02f8mR3Y8Pi=zFsoL14RqNQA8hwz1r4rSnDLr1b2Cw@mail.gmail.com
https://www.postgresql.org/message-id/CAHyXU0y9hviyKWvQZxX5UWfH9M2LYvwvAOPQ_DUPva2b71t12g%40mail.gmail.com
https://www.postgresql.org/message-id/20130716195834.8fe5c79249cb2ff0d4270b3e@yahoo.es
https://www.postgresql.org/message-id/CAE_gQfW3dBiELcOppYN6v%3D8%2B%2BpEeywD7iXGw-OT3doB8SXO4_A%40mail.gmail.com
https://www.postgresql.org/message-id/flat/1436268563235-5856914.post%40n5.nabble.com#1436268563235-5856914.post@n5.nabble.com
https://www.postgresql.org/message-id/CAL_0b1tJOZCx3Lo3Eve1RqGaT%2BJJ_Q7w4pkJ87WfWwXbTugnxw@mail.gmail.com
https://www.postgresql.org/message-id/556E2068.7070007@vuole.me
https://www.postgresql.org/message-id/1415981309.90631.YahooMailNeo%40web133205.mail.ir2.yahoo.com
https://www.postgresql.org/message-id/CAHyXU0yXYpCXN4%3D81ZDRQu-oGzrcq2qNAXDpyz4oiQPPAGk4ew%40mail.gmail.com
https://www.pythian.com/blog/performance-tuning-hugepages-in-linux/
http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/

Justin



--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
Mark,

I received yesterday a second server having 16 drives bays. Just for a quick trial, I used 2 old 60GB SSD (a Kingston V300 and a ADATA SP900) to build a RAID0. To my surprise, my very pecky RAID controller (HP P410i) recognised them without a fuss (although as SATAII drives at 3Gb/s. A quick fio benchmark gives me 22000 random 4k read IOPS, more than my 5 146GB 10k SAS disks in RAID0). I moved my most frequently used index to this array and will try to do some benchmarks.
Knowing that SSDs based on SandForce-2281 controller are recognised by my server, I may buy a pair of bigger/newer ones to put my tables on.

Thanks!

Charles

On Sat, Jul 15, 2017 at 1:57 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
Thinking about this a bit more - if somewhat more blazing performance is needed, then this could be achieved via losing the RAID card and spinning disks altogether and buying 1 of the NVME or SATA solid state products: e.g

- Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds and 200K IOPS)

- Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)


The Samsung needs an M.2 port on the mobo (but most should have 'em - and if not PCIe X4 adapter cards are quite cheap). The Intel is a bit more expensive compared to the Samsung, and is slower but has a longer lifetime. However for your workload the Sammy is probably fine.

regards

Mark


On 15/07/17 11:09, Mark Kirkwood wrote:
Ah yes - that seems more sensible (but still slower than I would expect for 5 disks RAID 0).



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--

Re: [PERFORM] Very poor read performance, query independent

From
Charles Nadeau
Date:
All,

Here is a list of what I did based of the suggestions made after my initial post:
*Reduce max_parallel_workers to 4: Values higher makes the workers wait for data as the RAID0 array can't deliver high enough IOPS.
*Reduce random_page_cost to 1: Forcing the use of index makes queries faster despite low random throughput.
*Increase shared_buffer to 66GB and effective_cache_size to 53GB: With the new server having 144GB of RAM, increasing shared_buffer allows Postgresql to keep a lot of data in memory reducing the need to go to disk.
*Reduce min_parallel_relation_size to 512kB to have more workers when doing sequential parallel scan
*Increased the /sys/block/sd[ac]/queue/read_ahead_kb to 16384 for my arrays using HDD
*Reused old SSDs (that are compatible with my RAID controller, to my surprise) to put my most used index and tables.

Thanks to everybody who made suggestions. I now know more about Postgresql tuning.

Charles

On Mon, Jul 10, 2017 at 4:03 PM, Charles Nadeau <charles.nadeau@gmail.com> wrote:

I’m running PostgreSQL 9.6.3 on Ubuntu 16.10 (kernel 4.4.0-85-generic). Hardware is:

*2x Intel Xeon E5550

*72GB RAM

*Hardware RAID10 (4 x 146GB SAS 10k) P410i controller with 1GB FBWC (80% read/20% write) for Postgresql data only:

Logical Drive: 3

Size: 273.4 GB

Fault Tolerance: 1+0

Heads: 255

Sectors Per Track: 32

Cylinders: 65535

Strip Size: 128 KB

Full Stripe Size: 256 KB

Status: OK

Caching: Enabled

Unique Identifier: 600508B1001037383941424344450A00

Disk Name: /dev/sdc

Mount Points: /mnt/data 273.4 GB

OS Status: LOCKED

Logical Drive Label: A00A194750123456789ABCDE516F

Mirror Group 0:

physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 146 GB, OK)

physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 146 GB, OK)

Mirror Group 1:

physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS, 146 GB, OK)

physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SAS, 146 GB, OK)

Drive Type: Data

Formatted with ext4 with: sudo mkfs.ext4 -E stride=32,stripe_width=64 -v /dev/sdc1.

Mounted in /etc/fstab with this line: "UUID=99fef4ae-51dc-4365-9210-0b153b1cbbd0 /mnt/data ext4 rw,nodiratime,user_xattr,noatime,nobarrier,errors=remount-ro 0 1"

Postgresql is the only application running on this server.


Postgresql is used as a mini data warehouse to generate reports and do statistical analysis. It is used by at most 2 users and fresh data is added every 10 days. The database has 16 tables: one is 224GB big and the rest are between 16kB and 470MB big.


My configuration is:


name | current_setting | source

---------------------------------+------------------------------------------------+----------------------

application_name | psql | client

autovacuum_vacuum_scale_factor | 0 | configuration file

autovacuum_vacuum_threshold | 2000 | configuration file

checkpoint_completion_target | 0.9 | configuration file

checkpoint_timeout | 30min | configuration file

client_encoding | UTF8 | client

client_min_messages | log | configuration file

cluster_name | 9.6/main | configuration file

cpu_index_tuple_cost | 0.001 | configuration file

cpu_operator_cost | 0.0005 | configuration file

cpu_tuple_cost | 0.003 | configuration file

DateStyle | ISO, YMD | configuration file

default_statistics_target | 100 | configuration file

default_text_search_config | pg_catalog.english | configuration file

dynamic_shared_memory_type | posix | configuration file

effective_cache_size | 22GB | configuration file

effective_io_concurrency | 4 | configuration file

external_pid_file | /var/run/postgresql/9.6-main.pid | configuration file

lc_messages | C | configuration file

lc_monetary | en_CA.UTF-8 | configuration file

lc_numeric | en_CA.UTF-8 | configuration file

lc_time | en_CA.UTF-8 | configuration file

listen_addresses | * | configuration file

lock_timeout | 100s | configuration file

log_autovacuum_min_duration | 0 | configuration file

log_checkpoints | on | configuration file

log_connections | on | configuration file

log_destination | csvlog | configuration file

log_directory | /mnt/bigzilla/data/toburn/hp/postgresql/pg_log | configuration file

log_disconnections | on | configuration file

log_error_verbosity | default | configuration file

log_file_mode | 0600 | configuration file

log_filename | postgresql-%Y-%m-%d_%H%M%S.log | configuration file

log_line_prefix | user=%u,db=%d,app=%aclient=%h | configuration file

log_lock_waits | on | configuration file

log_min_duration_statement | 0 | configuration file

log_min_error_statement | debug1 | configuration file

log_min_messages | debug1 | configuration file

log_rotation_size | 1GB | configuration file

log_temp_files | 0 | configuration file

log_timezone | localtime | configuration file

logging_collector | on | configuration file

maintenance_work_mem | 3GB | configuration file

max_connections | 10 | configuration file

max_locks_per_transaction | 256 | configuration file

max_parallel_workers_per_gather | 14 | configuration file

max_stack_depth | 2MB | environment variable

max_wal_size | 4GB | configuration file

max_worker_processes | 14 | configuration file

min_wal_size | 2GB | configuration file

parallel_setup_cost | 1000 | configuration file

parallel_tuple_cost | 0.012 | configuration file

port | 5432 | configuration file

random_page_cost | 22 | configuration file

seq_page_cost | 1 | configuration file

shared_buffers | 34GB | configuration file

shared_preload_libraries | pg_stat_statements | configuration file

ssl | on | configuration file

ssl_cert_file | /etc/ssl/certs/ssl-cert-snakeoil.pem | configuration file

ssl_key_file | /etc/ssl/private/ssl-cert-snakeoil.key | configuration file

statement_timeout | 1000000s | configuration file

stats_temp_directory | /var/run/postgresql/9.6-main.pg_stat_tmp | configuration file

superuser_reserved_connections | 1 | configuration file

syslog_facility | local1 | configuration file

syslog_ident | postgres | configuration file

syslog_sequence_numbers | on | configuration file

temp_file_limit | 80GB | configuration file

TimeZone | localtime | configuration file

track_activities | on | configuration file

track_counts | on | configuration file

track_functions | all | configuration file

unix_socket_directories | /var/run/postgresql | configuration file

vacuum_cost_delay | 1ms | configuration file

vacuum_cost_limit | 5000 | configuration file

vacuum_cost_page_dirty | 200 | configuration file

vacuum_cost_page_hit | 10 | configuration file

vacuum_cost_page_miss | 100 | configuration file

wal_buffers | 16MB | configuration file

wal_compression | on | configuration file

wal_sync_method | fdatasync | configuration file

work_mem | 1468006kB | configuration file


The part of /etc/sysctl.conf I modified is:

vm.swappiness = 1

vm.dirty_background_bytes = 134217728

vm.dirty_bytes = 1073741824

vm.overcommit_ratio = 100

vm.zone_reclaim_mode = 0

kernel.numa_balancing = 0

kernel.sched_autogroup_enabled = 0

kernel.sched_migration_cost_ns = 5000000


The problem I have is very poor read. When I benchmark my array with fio I get random reads of about 200MB/s and 1100IOPS and sequential reads of about 286MB/s and 21000IPS. But when I watch my queries using pg_activity, I get at best 4MB/s. Also using dstat I can see that iowait time is at about 25%. This problem is not query-dependent.

I backed up the database, I reformated the array making sure it is well aligned then restored the database and got the same result.

Where should I target my troubleshooting at this stage? I reformatted my drive, I tuned my postgresql.conf and OS as much as I could. The hardware doesn’t seem to have any issues, I am really puzzled.

Thanks!


Charles


--
Charles Nadeau Ph.D.



--

Re: [PERFORM] Very poor read performance, query independent

From
Mark Kirkwood
Date:
Nice!

Pleased that the general idea worked well for you!

I'm also relieved that you did not follow my recommendation exactly -
I'm been trialling a Samsung 960 Evo (256GB) and Intel 600p (256GB) and
I've stumbled across the serious disadvantages of (consumer) M.2 drives
using TLC NAND - terrible sustained write performance! While these guys
can happily do ~ 2GB/s reads, their write performance is only 'burst
capable'. They have small SLC NAND 'write caches' that do  ~1GB/s for a
*limited time* (10-20s) and after that you get ~ 200 MB/s! Ouch - my old
Crucial 550 can do 350 MB/s sustained writes (so two of them in RAID0
are doing 700 MB/s for hours).

Bigger capacity drives can do better - but overall I'm not that
impressed with the current trend of using TLC NAND.

regards

Mark


On 21/07/17 00:50, Charles Nadeau wrote:
> Mark,
>
> I received yesterday a second server having 16 drives bays. Just for a
> quick trial, I used 2 old 60GB SSD (a Kingston V300 and a ADATA SP900)
> to build a RAID0. To my surprise, my very pecky RAID controller (HP
> P410i) recognised them without a fuss (although as SATAII drives at
> 3Gb/s. A quick fio benchmark gives me 22000 random 4k read IOPS, more
> than my 5 146GB 10k SAS disks in RAID0). I moved my most frequently
> used index to this array and will try to do some benchmarks.
> Knowing that SSDs based on SandForce-2281 controller are recognised by
> my server, I may buy a pair of bigger/newer ones to put my tables on.
>
> Thanks!
>
> Charles
>
> On Sat, Jul 15, 2017 at 1:57 AM, Mark Kirkwood
> <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
> wrote:
>
>     Thinking about this a bit more - if somewhat more blazing
>     performance is needed, then this could be achieved via losing the
>     RAID card and spinning disks altogether and buying 1 of the NVME
>     or SATA solid state products: e.g
>
>     - Samsung 960 Pro or Evo 2 TB (approx 1 or 2 GB/s seq scan speeds
>     and 200K IOPS)
>
>     - Intel S3610 or similar 1.2 TB (500 MB/s seq scan and 30K IOPS)
>
>
>     The Samsung needs an M.2 port on the mobo (but most should have
>     'em - and if not PCIe X4 adapter cards are quite cheap). The Intel
>     is a bit more expensive compared to the Samsung, and is slower but
>     has a longer lifetime. However for your workload the Sammy is
>     probably fine.
>
>     regards
>
>     Mark
>
>     On 15/07/17 11:09, Mark Kirkwood wrote:
>
>         Ah yes - that seems more sensible (but still slower than I
>         would expect for 5 disks RAID 0).
>
>
>
>
>     --
>     Sent via pgsql-performance mailing list
>     (pgsql-performance@postgresql.org
>     <mailto:pgsql-performance@postgresql.org>)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-performance
>     <http://www.postgresql.org/mailpref/pgsql-performance>
>
>
>
>
> --
> Charles Nadeau Ph.D.
> http://charlesnadeau.blogspot.com/